Postgres LISTEN/NOTIFY does not scale

(www.recall.ai)

570 points davidgu | 2 comments | 07 Jul 25 14:05 UTC | HN request time: 0.417s | source

Show context

osigurdson ◴[11 Jul 25 02:13 UTC] No.44527817[source]▶

I like this article. Lots of comments are stating that they are "using it wrong" and I'm sure they are. However, it does help to contrast the much more common, "use Postgres for everything" type sentiment. It is pretty hard to use Postgres wrong for relational things in the sense that everyone knows about indexes and so on. But using something like L/N comes with a separate learning curve anyway - evidenced in this case by someone having to read comments in the Postgres source code itself. Then if it turns out that it cannot work for your situation it may be very hard to back away from as you may have tightly integrated it with your normal Postgres stuff.

I've landed on Postgres/ClickHouse/NATS since together they handle nearly any conceivable workload managing relational, columnar, messaging/streaming very well. It is also not painful at all to use as it is lightweight and fast/easy to spin up in a simple docker compose. Postgres is of course the core and you don't always need all three but compliment each other very well imo. This has been my "go to" for a while.

replies(12): >>44528211 #>>44528216 #>>44529511 #>>44529632 #>>44529640 #>>44529854 #>>44530773 #>>44531235 #>>44531722 #>>44532418 #>>44532993 #>>44534858 #

ownagefool ◴[11 Jul 25 08:04 UTC] No.44529511[source]▶

>>44527817 #

Largely agree. Functionality wise if you don't have many jobs, using the database as the queue is fine.

However, I've been in several situations where scaling the queue brings down the database, and therefore the app, and am thus of the opinion you probably shouldn't couple these systems too tightly.

There are pros and cons, of course.

replies(1): >>44529699 #

mike_hearn ◴[11 Jul 25 08:36 UTC] No.44529699[source]▶

>>44529511 #

Using the database for queues is more than fine, it's often essential to correctness. In many use cases for queues you need to atomically update the database with respect to popping from the queue, and if they're separate systems you end up needing either XA or brittle and unreliable custom idempotency logic. I've seen this go wrong before and it's not nice, the common outcome is business-visible data corruption that can have financial impact.

This seems like another case where Postgres gets free marketing due to companies hitting its technical limits. I get why they choose to make lemonade in these cases with an eng blog post, but this is a way too common pattern on HN. Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business. OpenAI had a similar blog post a couple of months ago where they revealed they were probably spending more than quarter of a million a month on an Azure managed Postgres, and it had stopped scaling so they were having to slowly abandon it, where I made the same comment [1].

Postgres is a great DB for what you pay, but IMHO well capitalized blitzscaling startups shouldn't be using it. If you buy a database - and realistically most Postgres users do anyway as they're paying for a cloud managed db - then you might as well just buy a commercial DB with an integrated queue engine. I have a financial COI because I have a part time job there in the research division (on non-DB stuff), so keep that in mind, but they should just migrate to an Oracle Database. It has a queue engine called TxEQ which is implemented on top of database tables with some C code for efficient blocking polls. It scales horizontally by just adding database nodes whilst retaining ACID transactions, and you can get hosted versions of them in all the major clouds. I'm using it in a project at the moment and it's been working well. In particular the ability to dequeue a message into the same transaction that does other database writes is very useful, as is the exposed lock manager.

Beyond scaling horizontally the nice thing about TxEQ/AQ is that it's a full message queue broker with all the normal features you'd expect. Delayed messages, exception queues, queue browsing, multi-consumer etc. LISTEN/NOTIFY is barely a queue at all, really.

For startups like this, the amount of time, money and morale they are losing with all these constant stories of firefights just doesn't make sense to me. It doesn't have to be Oracle, there are other DBs that can do this too. But "We discovered X about Postgres" is a eng blog cliché by this point. You're paying $$$ to a cloud and GPU vendor anyway, just buy a database and get back to work!

[1] https://news.ycombinator.com/item?id=44074506

replies(5): >>44531153 #>>44531375 #>>44532059 #>>44532779 #>>44535131 #

ownagefool ◴[11 Jul 25 12:29 UTC] No.44531375[source]▶

>>44529699 #

It actually depends on the workload.

Sending webhooks, as an example, often has zero need to go back and update the database, but I've seen that exact example take down several different managed databases ( i.e., not just postgres ).

replies(1): >>44533280 #

mike_hearn ◴[11 Jul 25 15:29 UTC] No.44533280[source]▶

>>44531375 #

Yes that's true but in good implementations you will want to surface to the recipient via some dashboard if delivery consistently fails. So at some point a message on the exception queue will want to update the db.

replies(1): >>44551338 #

1. ownagefool ◴[13 Jul 25 16:00 UTC] No.44551338[source]▶

>>44533280 #

Whilst true, it probably doesn't need ACID / strong consistency.

Don't get me wrong. Correctness is a great default, much easier to reason about.

replies(1): >>44569210 #

2. mike_hearn ◴[15 Jul 25 08:43 UTC] No.44569210[source]▶

>>44551338 (TP) #

You're right, I think we're in agreement.

↑