Postgres LISTEN/NOTIFY does not scale

(www.recall.ai)

570 points davidgu | 4 comments | 07 Jul 25 14:05 UTC | HN request time: 0.016s | source

Show context

hombre_fatal ◴[10 Jul 25 20:24 UTC] No.44525152[source]▶

>>44490510 (OP) #

Interesting. What if you just execute `NOTIFY` in its own connection outside of / after the transaction?

replies(4): >>44525332 #>>44525408 #>>44526003 #>>44528534 #

parthdesai ◴[10 Jul 25 21:48 UTC] No.44526003[source]▶

>>44525152 #

You lose transactional guarantees if you notify outside of the transaction though

replies(2): >>44526146 #>>44534706 #

hombre_fatal ◴[10 Jul 25 22:02 UTC] No.44526146[source]▶

>>44526003 #

Yeah, but pub/sub systems already need to be robust to missed messages. And, sending the notify after the transaction succeeds usually accomplishes everything you really care about (no false positives).

replies(1): >>44526451 #

parthdesai ◴[10 Jul 25 22:35 UTC] No.44526451[source]▶

>>44526146 #

What happens when transaction succeeds but the execution of NOTIFY fails if it's outside of transaction, in it's own separate connection?

replies(2): >>44526695 #>>44527506 #

saltcured ◴[10 Jul 25 23:04 UTC] No.44526695[source]▶

>>44526451 #

For reliability, you can make the recipient poll the table(s) of record for relevant state and use the out-of-band notification channel as a latency-reducer. So, the poller is eventually consistent at some configured polling interval, but opportunistically can respond much sooner when told to check again ahead of the next scheduled poll time.

In my experience, this means you make sure the polling solution is complete and correct, and the notifier gets reduced to a wake-up signal. This signal doesn't even need to carry the actionable change content, if the poller can already pose efficient queries for whatever "new stuff" it needs.

This approach also allows the poller to keep its own persistent cursor state if there is some stateful sequence to how it consumes the DB content. It automatically resynchronizes and the notification channel does not need to be kept in lock-step with the consumption.

replies(2): >>44527174 #>>44528147 #

1. valenterry ◴[11 Jul 25 03:29 UTC] No.44528147[source]▶

>>44526695 #

> you can make the recipient poll the table(s) of record for relevant state

That is tricky due to transactions and visibility. How do you write the poller to not miss events that were written by a long/blocked transaction? You'd have to set the poller scan to a long time (e.g. "process events that were written since now minus 5minutes") and then make sure transactions are cancelled hard before those 5minutes.

replies(1): >>44542829 #

2. saltcured ◴[12 Jul 25 15:41 UTC] No.44542829[source]▶

>>44528147 (TP) #

I'd say that the most reliable way is to use some mutable lifecycle metadata other than times to identify work. An indexed query will find the "new and unclaimed" work items and process them, regardless of their potentially backdated temporal metadata.

Updates of the lifecycle properties can also help coordinate multiple pollers so that they never work on the same item, but they can have overlapping query terms so that each poller is capable of picking up a particular item in the absence of others getting there first.

You also need some kind of lease/timeout policy to recognize orphaned items. I.e. claimed in the DB but not making progress. Workers can and should have exception handling and compensating updates to report failures and put items "back in the queue", but worst case this update may be missing. Some process, or even some human operator, needs to eventually compensate on behalf of the AWOL worker.

In my view, you always need this kind of table-scanning logic, even if using something like AMQP for work dispatch. You get in trouble when you fool yourself into imagining "exactly once" semantics actually exists. The message-passing layer could opportunistically scale out the workload, but a relational backstop can make sure that the real system of record is coherent and reflecting the business goals. Sometimes, you can just run this relational layer as the main work scheduler and skip the whole message-passing build-out.

replies(1): >>44547076 #

3. valenterry ◴[13 Jul 25 02:44 UTC] No.44547076[source]▶

>>44542829 #

The problem is that you now have to poll based on an index (maybe BRIN isn't too bad though) and you have to overwrite the row afterwards and update the index. That means you are creating a dead tuple for every row (and one more if you mark it to be "completed").

replies(1): >>44552879 #

4. saltcured ◴[13 Jul 25 19:31 UTC] No.44552879{3}[source]▶

>>44547076 #

Yes, everything is tradeoffs.

When trying to make good use of RDMBS transactional semantics, I think an important mental shift is to think of there being multiple async processing domains rather than a single magical transaction space. DB transactions are just communication events, not actual business work. This is how the relational DB can become the message broker.

The agents need to do something akin to 2-phase commit protocols to record their "intent" and their "result" across different business resources. But, for a failure-prone, web style network of agents, I would not expose actual DB 2-phase commit protocols. Instead, the relational model reifies the 2-phase-like state ambiguity of particular business resources as tuples, and the agents communicate important phases of their work process with simpler state update transactions.

It's basically the same pattern as with safe use of AMQP, just replacing one queue primitive with another. Both approaches require delayed acknowledgement patterns, so tasks can be routed to an agent but not removed from the system until after the agent reports the work complete. Either approach has an lost or orphaned task hazard if naively written to dequeue tasks earlier in the work process. An advantage of the RDBMS-based message broker is that you can use also use SQL to supervise all the lifecycle state, or even intervene to clean up after agent failures.

In this approach, don't scale-up a central RDMBS by disabling all its useful features in a mad dash for speed. Instead, think of the network of async agents (human or machine) and RDMBS message broker(s) to make for their respective traffic. This agent network and communication workload can often be partitioned to reach scaling goals. E.g. specific business resources might go into different "home" zones with distinct queues and agent pools. Their different lifecycle states do not need to exist under a single, common transaction control.

↑