Postgres LISTEN/NOTIFY does not scale

(www.recall.ai)

570 points davidgu | 5 comments | 07 Jul 25 14:05 UTC | HN request time: 0.905s | source

Show context

leontrolski ◴[10 Jul 25 21:00 UTC] No.44525560[source]▶

I'd be interested as to how dumb-ol' polling would compare here (the FOR UPDATE SKIP LOCKED method https://leontrolski.github.io/postgres-as-queue.html). One day I will set up some benchmarks as this is the kind of thing people argue about a lot without much evidence either way.

Wasn't aware of this AccessExclusiveLock behaviour - a reminder (and shameless plug 2) of how Postgres locks interact: https://leontrolski.github.io/pglockpy.html

replies(9): >>44525593 #>>44525651 #>>44525828 #>>44525857 #>>44527315 #>>44527425 #>>44527778 #>>44528689 #>>44533402 #

singron ◴[10 Jul 25 21:33 UTC] No.44525857[source]▶

>>44525560 #

Polling is the way to go, but it's also very tricky to get right. In particular, it's non-trivial to make a reliable queue that's also fast when transactions are held open and vacuum isn't able to clean tuples. E.g. "get the first available tuple" might have to skip over 1000s of dead tuples.

Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.

replies(3): >>44527188 #>>44528430 #>>44530342 #

1. time0ut ◴[11 Jul 25 00:22 UTC] No.44527188[source]▶

>>44525857 #

An approach that has worked for me is to hash partition the table and have each worker look for work in one partition at a time. There are a number of strategies depending on how you manage workers. This allows you to only consider 1/Nth of the dead tuples, where N is the number of partitions, when looking for work. It does come at the cost of strict ordering, but there are many use cases where strict ordering is not required. The largest scale implementation of this strategy that I have done had 128 partitions with a worker per partition pumping through ~100 million tasks per day.

I also found LISTEN/NOTIFY to not work well at this scale and used a polling based approach with a back off when no work was found.

Quite an interesting problem and a bit challenging to get right at scale.

replies(3): >>44527269 #>>44527477 #>>44527797 #

2. dfsegoat ◴[11 Jul 25 00:36 UTC] No.44527269[source]▶

>>44527188 (TP) #

If there were a toy or other public implementation of this, I would love to see it.

3. j16sdiz ◴[11 Jul 25 01:13 UTC] No.44527477[source]▶

>>44527188 (TP) #

Can't change the number of partition dynamically.

Additional challenge if jobs comes in funny sizes

replies(1): >>44528283 #

4. CBLT ◴[11 Jul 25 02:10 UTC] No.44527797[source]▶

>>44527188 (TP) #

This is how Kafka does it. Kafka has spent years working on the rough edges (e.g. partition resizing), haven't used it recently though.

5. AlisdairO ◴[11 Jul 25 04:01 UTC] No.44528283[source]▶

>>44527477 #

Depending on exactly what you need, you can often fake this with a functional index on mod(queue_value_id, 5000). You then query for mod(queue_value_id,5000) between m and n. You can then dynamically adjust the gap between m and n based on how many partitions you want

↑