Most active commenters

atombender(4)
singron(3)

Popular/hot comments

>>44525857 #
>>44527188 #

←back to thread

Postgres LISTEN/NOTIFY does not scale

(www.recall.ai)

1. leontrolski ◴[10 Jul 25 21:00 UTC] No.44525560[source]▶

>>44490510 (OP) #

I'd be interested as to how dumb-ol' polling would compare here (the FOR UPDATE SKIP LOCKED method https://leontrolski.github.io/postgres-as-queue.html). One day I will set up some benchmarks as this is the kind of thing people argue about a lot without much evidence either way.

Wasn't aware of this AccessExclusiveLock behaviour - a reminder (and shameless plug 2) of how Postgres locks interact: https://leontrolski.github.io/pglockpy.html

replies(9): >>44525593 #>>44525651 #>>44525828 #>>44525857 #>>44527315 #>>44527425 #>>44527778 #>>44528689 #>>44533402 #

2. aurumque ◴[10 Jul 25 21:03 UTC] No.44525593[source]▶

>>44525560 (TP) #

I'll take the shameless plug. Thank you for putting this together! Very helpful overview of pg locks.

replies(1): >>44535273 #

3. cpursley ◴[10 Jul 25 21:10 UTC] No.44525651[source]▶

>>44525560 (TP) #

Have you played with pgmq? It's pretty neat: https://github.com/pgmq/pgmq

replies(1): >>44526409 #

4. RedShift1 ◴[10 Jul 25 21:29 UTC] No.44525828[source]▶

>>44525560 (TP) #

I use polling with back off up to one minute. So when a workload is done, it immediately polls for more work. If nothing found, wait for 5 seconds, still nothing 10 seconds, ... until one minute and from then on it polls every minute until it finds work again and the back off timer resets to 0 again.

5. singron ◴[10 Jul 25 21:33 UTC] No.44525857[source]▶

>>44525560 (TP) #

Polling is the way to go, but it's also very tricky to get right. In particular, it's non-trivial to make a reliable queue that's also fast when transactions are held open and vacuum isn't able to clean tuples. E.g. "get the first available tuple" might have to skip over 1000s of dead tuples.

Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.

replies(3): >>44527188 #>>44528430 #>>44530342 #

6. edoceo ◴[10 Jul 25 22:29 UTC] No.44526409[source]▶

>>44525651 #

Another thing for @leontrolski to add to the benchmarks - which I cannot wait to read.

replies(1): >>44527164 #

7. cpursley ◴[11 Jul 25 00:19 UTC] No.44527164{3}[source]▶

>>44526409 #

There's a pretty cool solution built on pgmq called pgflow:

https://www.pgflow.dev/concepts/how-pgflow-works

8. time0ut ◴[11 Jul 25 00:22 UTC] No.44527188[source]▶

>>44525857 #

An approach that has worked for me is to hash partition the table and have each worker look for work in one partition at a time. There are a number of strategies depending on how you manage workers. This allows you to only consider 1/Nth of the dead tuples, where N is the number of partitions, when looking for work. It does come at the cost of strict ordering, but there are many use cases where strict ordering is not required. The largest scale implementation of this strategy that I have done had 128 partitions with a worker per partition pumping through ~100 million tasks per day.

I also found LISTEN/NOTIFY to not work well at this scale and used a polling based approach with a back off when no work was found.

Quite an interesting problem and a bit challenging to get right at scale.

replies(3): >>44527269 #>>44527477 #>>44527797 #

9. dfsegoat ◴[11 Jul 25 00:36 UTC] No.44527269{3}[source]▶

>>44527188 #

If there were a toy or other public implementation of this, I would love to see it.

10. TkTech ◴[11 Jul 25 00:42 UTC] No.44527315[source]▶

>>44525560 (TP) #

With that experience behind you, would you have feedback for Chancy[1]? It aims to be a batteries-included offering for postgres+python, aiming for hundreds of millions of jobs a day, not massive horizontal worker scaling.

It both polls (configurable per queue) and supports listen/notify simply to inform workers that it can wake up early to trigger polling, and this can be turned off globally with a notifications=false flag.

[1]: https://github.com/tktech/chancy

11. qianli_cs ◴[11 Jul 25 01:03 UTC] No.44527425[source]▶

>>44525560 (TP) #

My colleague did some internal benchmarking and found that LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. Our findings were pretty consistent with this blog post.

(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.

Would love to hear feedback from you and others building similar systems.

[1] https://github.com/dbos-inc/dbos-transact-py

replies(2): >>44527862 #>>44530722 #

12. j16sdiz ◴[11 Jul 25 01:13 UTC] No.44527477{3}[source]▶

>>44527188 #

Can't change the number of partition dynamically.

Additional challenge if jobs comes in funny sizes

replies(1): >>44528283 #

13. sorentwo ◴[11 Jul 25 02:06 UTC] No.44527778[source]▶

>>44525560 (TP) #

Ping requires something persistent to check. That requires creating tuples, and most likely deleting them after they’ve been consumed. That puts pressure on the database and requires vacuuming in ways that pubsub doesn’t because it’s entirely ephemeral.

Not to mention that pubsub allows multiple consumers for a single message, whereas FOR UPDATE is single consumer by design.

14. CBLT ◴[11 Jul 25 02:10 UTC] No.44527797{3}[source]▶

>>44527188 #

This is how Kafka does it. Kafka has spent years working on the rough edges (e.g. partition resizing), haven't used it recently though.

15. mind-blight ◴[11 Jul 25 02:22 UTC] No.44527862[source]▶

>>44527425 #

Nice! I'm using DBOS and am a little active on the discord. I was just wondering how y'all handled this under the hood. Glad to hear I don't have to worry much about this issue

16. AlisdairO ◴[11 Jul 25 04:01 UTC] No.44528283{4}[source]▶

>>44527477 #

Depending on exactly what you need, you can often fake this with a functional index on mod(queue_value_id, 5000). You then query for mod(queue_value_id,5000) between m and n. You can then dynamically adjust the gap between m and n based on how many partitions you want

17. leontrolski ◴[11 Jul 25 04:36 UTC] No.44528430[source]▶

>>44525857 #

> also fast when transactions are held open

In my linked example, on getting the item from the queue, you immediately set the status to something that you're not polling for - does Postgres still have to skip past these tuples (even in an index) until they're vacuumed up?

18. broken_broken_ ◴[11 Jul 25 05:33 UTC] No.44528689[source]▶

>>44525560 (TP) #

I have implemented polling against a cluster of mixed mariadb/mysql databases which do not offer listen/notify. It was a pain in the neck to get right.

- The batch size needs to be adaptative for performance, latency, and recovering smoothly after downtime.

- The polling timeouts, frequency etc the same.

- You need to avoid hysteresis.

- You want to be super careful about not disturbing the main application by placing heavy load on the database or accidentally locking tables/rows

- You likely want multiple distributed workers in case of a network partition to keep handling events

It’s hard to get right especially when the databases at the time did not support SKIP LOCKED.

In retrospect I wish I had listened to the WAL. Much easier.

19. atombender ◴[11 Jul 25 09:59 UTC] No.44530342[source]▶

>>44525857 #

Dead tuples is a real and significant problem, not just because it has to skip the tuples, but because the statistics that drive the planner don't account for them.

I found this out the hard way when I had a simple query that suddenly got very, very slow on a table where the application would constantly do a `SELECT ... FOR UPDATE SKIP LOCKED` and then immediately delete the rows after a tiny bit of processing.

It turned out that with a nearly empty table of about 10-20k dead tuples, the planner switched to using a different index scan, and would overfetch tons of pages just to discard them, as they only contained dead tuples. What I didn't realize is that the planner statistics doesn't care about dead tuples, and ANALYZE doesn't take them into account. So the planner started to think the table was much bigger than it actually was.

It's really important for these uses cases to tweak the autovacuum settings (which can be set on a per-table basis) to be much more aggressive, so that under high load, the vacuum runs pretty much continuously.

Another option is to avoid deleting rows, but instead use a column to mark rows as complete, which together with a partial index can avoid dead tuples. There are both pros and cons; it requires doing the cleanup (and VACUUM) as a separate job.

replies(1): >>44535859 #

20. eatonphil ◴[11 Jul 25 10:57 UTC] No.44530722[source]▶

>>44527425 #

Why not read the WAL?

replies(1): >>44534419 #

21. cryptonector ◴[11 Jul 25 15:38 UTC] No.44533402[source]▶

>>44525560 (TP) #

Instead of LISTEN/NOTIFY you could listen to the wal / logical replication stream.

Or you could have a worker whose only job is to listen to the wal / logical replication stream and then NOTIFY. Being the only one to do so would not burden other transactions.

Or you could have a worker whose only job is to listen to the wal / logical replication stream and then publish on some non-PG pubsub system.

22. qianli_cs ◴[11 Jul 25 16:54 UTC] No.44534419{3}[source]▶

>>44530722 #

We considered using WAL for change tracking in DBOS, but it requires careful setup and maintenance of replication slots, which may lead to unbounded disk growth if misconfigured. Since DBOS is designed to bolt onto users' existing Postgres instances (we don't manage their data), we chose a simpler, less intrusive approach that doesn't require a replication setup.

Plus, for queues, it's so much easier to leverage database constraints and transactions to implement global concurrency limit, rate limit, and deduplication.

23. notarobot123 ◴[11 Jul 25 18:08 UTC] No.44535273[source]▶

>>44525593 #

It's funny how "shameless plug" actually means "excuse the self-promotion" and implies at least a little bit of shame even when the reference is appropriate and on-topic.

24. singron ◴[11 Jul 25 19:10 UTC] No.44535859{3}[source]▶

>>44530342 #

Unfortunately, updating the row also creates dead tuples. It's very tricky!

replies(1): >>44535896 #

25. atombender ◴[11 Jul 25 19:14 UTC] No.44535896{4}[source]▶

>>44535859 #

It does, but because of how indexes work, I believe it won't be skewed by the presence of dead tuples (though the bloat can cause the live dat to be spread across a lot more blocks and therefore generate more I/O) as long as you run autoanalyze semi-regularly.

replies(1): >>44536811 #

26. singron ◴[11 Jul 25 21:10 UTC] No.44536811{5}[source]▶

>>44535896 #

It depends on if you are getting Heap Only Tuples (HOT) updates or not. https://www.postgresql.org/docs/current/storage-hot.html

In this case, you might have enough dead tuples across your heap that you might get a lot of HOT updates. If you are processing in insertion order, you will also probably process in heap order, and you can actually get 0 HOT updates since the other tuples in the page aren't fully dead yet. You could try using a lower fillfactor to avoid this, but that's also bad for performance so it might not help.

replies(1): >>44537020 #

27. atombender ◴[11 Jul 25 21:39 UTC] No.44537020{6}[source]▶

>>44536811 #

If you have a "done" column that you filter on using a partial index, then it would never use HOT updates anyway, since HOT requires that none of the modified columns have an index.

replies(1): >>44537423 #

28. menthe ◴[11 Jul 25 22:30 UTC] No.44537423{7}[source]▶

>>44537020 #

False.

As of PG16, HOT updates are tolerated against summarizing indexes, such as BRIN.

https://www.postgresql.org/docs/16/storage-hot.html

Besides, you probably don't want "done" jobs in the same table as pending or retriable jobs - as you scale up, you likely want to archive them as it provides various operational advantages, at no cost.

replies(1): >>44537457 #

29. atombender ◴[11 Jul 25 22:34 UTC] No.44537457{8}[source]▶

>>44537423 #

Not false. Nobody would ever use BRIN for this. I'm talking about regular indexes, which do prevent HOT.

If you read my earlier comment properly, you'll notice a "done" column is to avoid deleting columns on the hot path and avoid dead tuples messing up the planner. I agree that a table should not contain done jobs, but then you risk running into the dead tuple problem. Both approaches are a compromise.

↑