Postgres LISTEN/NOTIFY does not scale

1. ilitirit ◴[11 Jul 25 07:05 UTC] No.44529161[source]▶

> The structured data gets written to our Postgres database by tens of thousands of simultaneous writers. Each of these writers is a “meeting bot”, which joins a video call and captures the data in real-time.

Maybe I missed it in some folded up embedded content, or some graph (or maybe I'm probably just blind...), but is it mentioned at which point they started running into issues? The quoted bit about "10s of thousands of simultaneous writers" is all I can find.

What is the qualitative and quantitative nature of relevant workloads? Depending on the answers, some people may not care.

I asked ChatGPT to research it and this is the executive summary:

  For PostgreSQL’s LISTEN/NOTIFY, a realistic safe throughput is:

  Up to ~100–500 notifications/sec: Handles well on most systems with minimal tuning. Low risk of contention.

  ~500–2,000 notifications/sec: Reasonable with good tuning (short transactions, fast listeners, few concurrent writers). May start to see lock contention.

  ~2,000–5,000 notifications/sec: Pushing the upper bounds. Requires careful batching, dedicated listeners, possibly separate Postgres instances for pub/sub.

  >5,000 notifications/sec: Not recommended for sustained load. You’ll likely hit serialization bottlenecks due to the global commit lock held during NOTIFY.

replies(1): >>44529202 #

2. ilitirit ◴[11 Jul 25 08:54 UTC] No.44529839[source]▶

>>44529202 #

What is wrong with you? Why would you even bother posting a comment like this?

Maybe you also don't know what ChatGPT Research is (the Enterprise version, if you really need to know), or what Executive Summary implies, but here's a snippet of the 28 sources used:

https://imgur.com/a/eMdkjAh

replies(1): >>44530622 #

3. ants_a ◴[11 Jul 25 10:43 UTC] No.44530622{3}[source]▶

>>44529839 #

In that snippet are links to Postgres docs and two blog posts, one being the blog post under discussion. None of those contain the information needed to make the presented claims about throughput.

To make those claims it's necessary to know what work is being done while the lock is held. This includes a bunch of various resource cleanup, which should be cheap, and RecordTransactionCommit() which will grab a lock to insert a WAL record, wait for it to get flushed to disk and potentially also for it to get acknowledged by a synchronous replica. So the expected throughput is somewhere between hundreds and tens of thousands of notifies per second. But as far as I can tell this conclusion is only available from PostgreSQL source code and some assumptions about typical storage and network performance.

replies(1): >>44531705 #

4. ilitirit ◴[11 Jul 25 13:06 UTC] No.44531705{4}[source]▶

>>44530622 #

> In that snippet are links to Postgres docs and two blog posts

Yes, that's what a snippet generally is. The generated document from my very basic research prompt is over 300k in length. There are also sources from the official mailing lists, graphile, and various community discussions.

I'm not going to post the entire outout because it is completely beside the point. In my original post, I expliclity asked "What is the qualitative and quantitative nature of relevant workloads?" exactly because it's not clear from the blog post. If, for example, they only started hitting these issues with 10k simultaneous reads/writes, then it's reasonable to assume that many people who don't have such high work loads won't really care.

The ChatGPT snippet was included to show that that's what ChatGPT research told me. Nothing more. I basically typed a 2-line prompt and asked it to include the original article. Anyone who thinks that what I posted is authoritative in any way shouldn't be considering doing this type of work.