https://github.com/cpursley/walex?tab=readme-ov-file#walex (there's a few useful links in here)
* It gives an indication of how much you need to grow before this Postgres functionality starts being a blocker.
* Folks encountering this issue—and its confusing log line—in the future will be able to find this post and quickly understand the issue.
Features that seem harmless at small scale can break everything at large scale.
The post author is too focused on using NOTIFY in only one way.
This post fails to explain WHY they are sending a NOTIFY. Not much use telling us what doesn’t work without telling us the actual business goal.
It’s crazy to send a notify for every transaction, they should be debounced/grouped.
The point of a NOTIFY is to let some other system know something has changed. Don’t do it every transaction.
It was particularly ironic because Elixir has a fantastic distribution and pubsub story thanks to distributed Erlang. That’s much more commonly used in apps now compared to 5 or so years ago when 40-50% of apps didn’t weren’t clustered. Thanks to the rise of platforms like Fly that made it easier, and the decline of Heroku that made it nearly impossible.
Wasn't aware of this AccessExclusiveLock behaviour - a reminder (and shameless plug 2) of how Postgres locks interact: https://leontrolski.github.io/pglockpy.html
Becomes a problem if you are inserting 40 items to order_items table.
What I already know
- Unique indexes slow inserts since db has to acquire a full table lock
- Case statements in Where break query planner/optimizer and require full table scans
- Read only postgres functions should be marked as `STABLE PARALLEL SAFE`
Like if it needs to be very consistent I would use an unlogged table (since we're worried about "scale" here) and then `FOR UPDATE SKIP LOCKED` like others have mentioned. Otherwise what exactly is notify doing that can't be done after the first transaction?
Edit: in-fact, how can they send an HTTP call for something and not be able to do a `NOTIFY` after as well?
One possible way I could understand what they wrote is that somewhere in their code, within the same transaction, there are notifies which conditionally trigger and it would be difficult to know which ones to notify again in another transaction after the fact. But they must know enough to make the HTTP call, so why not NOTIFY?
Can’t find the function that does that, and I’ve not seen it used in the wild yet, idk if there’s gotchas
Edit: found it, it’s pg_logical_emit_message
Source: Dev at one of the companies that hit this issue with Oban
Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.
They’re using it wrong and blaming Postgres.
Instead they should use Postgres properly and architect their system to match how Postgres works.
There’s correct ways to notify external systems of events via NOTIFY, they should use them.
It'd be nice to have a method that would block for N seconds waiting for a new entry.
You can also use a streaming replication connection, but it often is not enabled by default.
I feel like somebody needs to write a book on system architecture for Gen Z that's just filled with memes. A funny cat pic telling people not to use the wrong tool will probably make more of an impact than an old fogey in a comment section wagging his finger.
''' When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits. '''
Am I missing something - this seems like something the original authors of the system should have done due diligence on before implementing a write heavy work load.
Might be a bit tricky to get debezium to decode the logical event, not sure
You'd have to at least accompany your memes with empirics. What is write-heavy? A number you might hit if your startup succeeds with thousands of concurrent users on your v1 naive implementation?
Else you just get another repeat of everyone cargo-culting Mongo because they heard that Postgres wasn't web scale for their app with 0 users.
…of course, you need dedup/support for duplicate messages on the notify stream if you do this, but that’s table stakes in a lot of messaging scenarios anyway.
This does sacrifice ordering and increases the risk of duplicates in the message stream, though.
If each tenant gets an instance I would call that a “shard” but in that pattern there’s no need for cross-shard references.
Maybe in the analytics stack but that can be async and eventually consistent.
In my experience, this means you make sure the polling solution is complete and correct, and the notifier gets reduced to a wake-up signal. This signal doesn't even need to carry the actionable change content, if the poller can already pose efficient queries for whatever "new stuff" it needs.
This approach also allows the poller to keep its own persistent cursor state if there is some stateful sequence to how it consumes the DB content. It automatically resynchronizes and the notification channel does not need to be kept in lock-step with the consumption.
I wonder if that is fixable, or just inherent to its design.
Opaque to who? If there's a piece of business logic that says "After this table's record is updated, you MUST update this other table", what advantages are there to putting that logic in the application?
When (not if) some other application updates that record you are going to have a broken database.
Some things are business constraints, and as such they should be moved into the database if at all possible. The application should never enforce constraints such as "either this column or that column is NULL, but at least one must be NULL and both must never be NULL at the same time".
Your database enforces constraints; what advantages are there to code the enforcement into every application that touches the database over simply coding the constraints into the database?
Do you expect it to be faster to do the trigger logic in the application? Wouldn't be slower to execute two statements from the application (even if they are in a transaction) than to rely on triggers?
It’s unsurprising to me that an AI company appears to have chosen exactly the wrong tool for the job.
SQS may have been a good "boring" choice for this?
What were the TPS numbers? What was the workload like? How big is the difference in %?
Use LISTEN/NOTIFY. You will get a lot of utility out of it before you’re anywhere close to these problems.
However, in 2025 I'd pick Redis or MQTT for this kind of role. I'm typically in multi-lamg environments. Is there something better?
What did you replace them with instead?
I also found LISTEN/NOTIFY to not work well at this scale and used a polling based approach with a back off when no work was found.
Quite an interesting problem and a bit challenging to get right at scale.
It both polls (configurable per queue) and supports listen/notify simply to inform workers that it can wake up early to trigger polling, and this can be turned off globally with a notifications=false flag.
(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.
Would love to hear feedback from you and others building similar systems.
pg_logical_slot_get_binary_changes returns the same entries as the replication connection. It just has no support for long-polling.
That's where we use it at my work. We have host/networking deployment pipelines that used to have up to one minute latency on each step because each was ran on a one-minute cron. A short python script/service that handled the LISTENing + adding NOTIFYs when the next step was ready removed the latency and we'll never do enough for the load on the db to matter
Databases can do a lot of stuff, and if you're not hurting for DB performance it can be a good idea to just... do it in the database. The advantage is that, if the DB does it, you're much less likely to break things. Putting data constraints in application code can be done, but then you're just waiting for the day those constraints are broken.
Not to mention that pubsub allows multiple consumers for a single message, whereas FOR UPDATE is single consumer by design.
I've landed on Postgres/ClickHouse/NATS since together they handle nearly any conceivable workload managing relational, columnar, messaging/streaming very well. It is also not painful at all to use as it is lightweight and fast/easy to spin up in a simple docker compose. Postgres is of course the core and you don't always need all three but compliment each other very well imo. This has been my "go to" for a while.
The people who design it walk away after a few years, so they don't give a crap what happens. The rest of us have to struggle to support or try to replace whatever the lumbering monstrosity is.
There is work happening currently to make Kafka behave more like a queue: https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A...
I think it’s a reasonable assumption. Based on the second half of your comment, you clearly don’t think highly of “AI companies,” but I think that’s a separate issue.
That is tricky due to transactions and visibility. How do you write the poller to not miss events that were written by a long/blocked transaction? You'd have to set the poller scan to a long time (e.g. "process events that were written since now minus 5minutes") and then make sure transactions are cancelled hard before those 5minutes.
You should split your system into specialized components: - Kafka for event transport (you're likely already doing this). - An LSM-tree DB for write-heavy structured data (eg: Cassandra) - Keep Postgres for queries that benefit from relational features in certain parts of your architecture
In my linked example, on getting the item from the queue, you immediately set the status to something that you're not polling for - does Postgres still have to skip past these tuples (even in an index) until they're vacuumed up?
In databases where your domain is also your physical data model, coupling business logic to the database can work quite well, if the DBMS supports that.
https://medium.com/@paul_42036/entity-workflows-for-event-dr...
An INSERT never results in a full table lock (as in "the lock would prevent other inserts or selects on the table)
Any expression used in the WHERE clause that isn't indexed will probably result in a Seq Scan. CASE expressions are no different than e.g. a function call regarding this.
A stable function marked as "STABLE" (or even immutable) can be optimized differently (e.g. can be "inlined"), so yes that's a good recommendation.
Recordings can and should be streamed to an object store. Parallel processes can do transcription on those objects; bonus: when they inevitably have a bug in transcription, retranscribing meetings is easy.
The output of transcription can be a single file also stored in the object store with a single completion message notification, or if they really insist on “near real-time”, a message on a queue for every N seconds. Much easier to scale your queue than your DB, eg Kafka partitions.
A handful of consumers can read those messages and insert into the DB. Benefit is you have a fixed and controllable write load into the database, and your client workload never overloads the DB because you’re buffering that with the much more distributed object store (which is way simpler than running another database engine).
I like that. Sounds like a synonym for "Platform Engineering". :-)
I remember being amazed that lambda architecture was considered a kind of reference, when it looked to me more like a workaround.
We like to build IT cathedrals, until we have to run them.
- The batch size needs to be adaptative for performance, latency, and recovering smoothly after downtime.
- The polling timeouts, frequency etc the same.
- You need to avoid hysteresis.
- You want to be super careful about not disturbing the main application by placing heavy load on the database or accidentally locking tables/rows
- You likely want multiple distributed workers in case of a network partition to keep handling events
It’s hard to get right especially when the databases at the time did not support SKIP LOCKED.
In retrospect I wish I had listened to the WAL. Much easier.
[1] https://speakerdeck.com/gunnarmorling/ins-and-outs-of-the-ou...
[2] https://www.infoq.com/articles/wonders-of-postgres-logical-d...
[3] https://www.morling.dev/blog/mastering-postgres-replication-...
I guess some will argue that their business logic is special and really is so tightly coupled to the data definition that it belongs in the database, and I’m not going to claim those use cases don’t exist, but I’ve seen over-coupling far more often than under-coupling.
This is why I say: Applications come and go, but data is forever.
None of this means you have to or even should use stored procedures, triggers, or listen/notify. I'm just making the point that there is no clean separation between "data" and "business logic".
The documentation doesn’t mention any caveats in this direction, and they had 3 periods of downtime in 4 days, so I don’t think it’s a given that testing would have hit this problem.
Maybe I missed it in some folded up embedded content, or some graph (or maybe I'm probably just blind...), but is it mentioned at which point they started running into issues? The quoted bit about "10s of thousands of simultaneous writers" is all I can find.
What is the qualitative and quantitative nature of relevant workloads? Depending on the answers, some people may not care.
I asked ChatGPT to research it and this is the executive summary:
For PostgreSQL’s LISTEN/NOTIFY, a realistic safe throughput is:
Up to ~100–500 notifications/sec: Handles well on most systems with minimal tuning. Low risk of contention.
~500–2,000 notifications/sec: Reasonable with good tuning (short transactions, fast listeners, few concurrent writers). May start to see lock contention.
~2,000–5,000 notifications/sec: Pushing the upper bounds. Requires careful batching, dedicated listeners, possibly separate Postgres instances for pub/sub.
>5,000 notifications/sec: Not recommended for sustained load. You’ll likely hit serialization bottlenecks due to the global commit lock held during NOTIFY.
My other reference for a slightly different problem is https://www.thatguyfromdelhi.com/2020/12/what-postgres-sql-c...
However, I've been in several situations where scaling the queue brings down the database, and therefore the app, and am thus of the opinion you probably shouldn't couple these systems too tightly.
There are pros and cons, of course.
Of course, this is sometimes abused and taken to extremes in a microservices architecture where each service has their own database and you end up with nastiness like data duplication and distributed locking.
This seems like another case where Postgres gets free marketing due to companies hitting its technical limits. I get why they choose to make lemonade in these cases with an eng blog post, but this is a way too common pattern on HN. Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business. OpenAI had a similar blog post a couple of months ago where they revealed they were probably spending more than quarter of a million a month on an Azure managed Postgres, and it had stopped scaling so they were having to slowly abandon it, where I made the same comment [1].
Postgres is a great DB for what you pay, but IMHO well capitalized blitzscaling startups shouldn't be using it. If you buy a database - and realistically most Postgres users do anyway as they're paying for a cloud managed db - then you might as well just buy a commercial DB with an integrated queue engine. I have a financial COI because I have a part time job there in the research division (on non-DB stuff), so keep that in mind, but they should just migrate to an Oracle Database. It has a queue engine called TxEQ which is implemented on top of database tables with some C code for efficient blocking polls. It scales horizontally by just adding database nodes whilst retaining ACID transactions, and you can get hosted versions of them in all the major clouds. I'm using it in a project at the moment and it's been working well. In particular the ability to dequeue a message into the same transaction that does other database writes is very useful, as is the exposed lock manager.
Beyond scaling horizontally the nice thing about TxEQ/AQ is that it's a full message queue broker with all the normal features you'd expect. Delayed messages, exception queues, queue browsing, multi-consumer etc. LISTEN/NOTIFY is barely a queue at all, really.
For startups like this, the amount of time, money and morale they are losing with all these constant stories of firefights just doesn't make sense to me. It doesn't have to be Oracle, there are other DBs that can do this too. But "We discovered X about Postgres" is a eng blog cliché by this point. You're paying $$$ to a cloud and GPU vendor anyway, just buy a database and get back to work!
Maybe you also don't know what ChatGPT Research is (the Enterprise version, if you really need to know), or what Executive Summary implies, but here's a snippet of the 28 sources used:
Of course, you have the people that correctly use em-dashes, too.
For startups, Postgres is a fantastic first choice. But plan ahead: as your workload grows, you’ll likely need to migrate or augment your stack.
1) the Postgres documentation does not mention that Notify causes a global lock or lock of any sort (I checked). That’s crazy to me; if something causes a lock, the documentation should tell you it does and what kind. Performance notes also belong in documentation for dbs.
2) why the hell does notify require a lock in the first place? Reading the comment this design seems insane; there’s no good reason to queue up notifications for transactions that aren’t committed. Just add the notifications in commit order with no lock, you’re building a db with concurrency, get used to it.
I found this out the hard way when I had a simple query that suddenly got very, very slow on a table where the application would constantly do a `SELECT ... FOR UPDATE SKIP LOCKED` and then immediately delete the rows after a tiny bit of processing.
It turned out that with a nearly empty table of about 10-20k dead tuples, the planner switched to using a different index scan, and would overfetch tons of pages just to discard them, as they only contained dead tuples. What I didn't realize is that the planner statistics doesn't care about dead tuples, and ANALYZE doesn't take them into account. So the planner started to think the table was much bigger than it actually was.
It's really important for these uses cases to tweak the autovacuum settings (which can be set on a per-table basis) to be much more aggressive, so that under high load, the vacuum runs pretty much continuously.
Another option is to avoid deleting rows, but instead use a column to mark rows as complete, which together with a partial index can avoid dead tuples. There are both pros and cons; it requires doing the cleanup (and VACUUM) as a separate job.
To make those claims it's necessary to know what work is being done while the lock is held. This includes a bunch of various resource cleanup, which should be cheap, and RecordTransactionCommit() which will grab a lock to insert a WAL record, wait for it to get flushed to disk and potentially also for it to get acknowledged by a synchronous replica. So the expected throughput is somewhere between hundreds and tens of thousands of notifies per second. But as far as I can tell this conclusion is only available from PostgreSQL source code and some assumptions about typical storage and network performance.
Turns out that all Postgres versions from 9.6 through current master scale linearly with the number of idle listeners — about 13 μs extra latency per connection. That adds up fast: with 1,000 idle listeners, a NOTIFY round-trip goes from ~0.4 ms to ~14 ms.
To better understand the bottlenecks, I wrote both a benchmark tool and a proof-of-concept patch that replaces the O(N) backend scan with a shared hash table for the single-listener case — and it brings latency down to near-O(1), even with thousands of listeners.
Full benchmark, source, and analysis here: https://github.com/joelonsql/pg-bench-listen-notify
No proposals yet on what to do upstream, just trying to gather interest and surface the performance cliff. Feedback welcome.
Maybe throw in a dedicated key-value store like Redis or Valkey.
Oh and maybe something S3 compatible like MinIO, Garage or SeaweedFS for storing bunches of binary data.
With all of that, honestly it should cover most of the common workloads out there! Of course, depends on how specialized vs generic you like your software to be.
I've seen people who disagree with that statement and say that having a separate back end component often leads to overfetching and in-database processing is better. I've worked on some systems where the back end is essentially just passing data to and from stored procedures.
It was blazing fast, but working with it absolutely sucked - though for whatever reason the people who believe that seem to hold those views quite strongly.
Back to the topic: Lots of potential bugs and data corruption issues are solved by moving part of the business logic to the database. Other people already covered two things: data validation and queue atomicity.
On the other hand, lots of potential issues can also arise by putting other parts of business logic to the database, for example, calling HTTPS endpoints from inside the DB itself is highly problematic.
The reality is that the world is not black and white, and being an engineer is about navigating this grey area.
There are things that work better, are safer and simpler to do on the database, and things that work better, are safer and simpler in code. And those things might change depending on context, technology, requirements, size of project, experience of contributors, etc.
Forcing round pegs into square holes will always lead to brittle code and brittle products, often for more cost (mental and financial!) than actually using each tool correctly.
https://github.com/daitangio/pque
I evaluated Listen/notify but it seems to loose messages if no one is listening, so its use case seems pretty limited to me (my 2 cents).
Anyway, If you need to scale, I suggest an ad hoc queue server like rabbitmq.
In the beginning having fewer parts to connect and maintain lets the needs and bottlenecks of the actual application emerge.
If it was listen/notify in such a scenario at some volume where optimizing it isn’t in the cards… so be it. It would be some time down the road before sharding a function into a specific subsystem like what you described.
Appreciate learning about the Postgres/Clickhouse/nats combo. If there might be an article if the three together that you liked would be happy to read and learn.
The only reasons I can think of:
- you're rewriting a legacy system and migrate parts incrementally
- data compliance
- you're running a dangerous database setup
I try my best to avoid putting any business logic inside databases and see stored procedures only as a temporary solution.
Sending webhooks, as an example, often has zero need to go back and update the database, but I've seen that exact example take down several different managed databases ( i.e., not just postgres ).
Asynchronous protocols frequently result in callback-based or generator-style APIs on the client side, which are hard to implement safely and intuitively. For example, consider building a real-time SDK for something like NATS. Once a message arrives, you need to invoke a user-defined callback to handle it. At that point, you're faced with a design decision: either call the callback synchronously (which risks blocking the socket reading loop), or do it asynchronously (which raises issues like backpressure handling).
Also, SDKs are often developed by different people, each with their own design philosophy and coding style, leading to inconsistency and subtle bugs.
So this isn't only about NATS. Just last week, we ran into two critical bugs in two separate Kafka SDKs at work.
Both have their place in business.
Yes, that's what a snippet generally is. The generated document from my very basic research prompt is over 300k in length. There are also sources from the official mailing lists, graphile, and various community discussions.
I'm not going to post the entire outout because it is completely beside the point. In my original post, I expliclity asked "What is the qualitative and quantitative nature of relevant workloads?" exactly because it's not clear from the blog post. If, for example, they only started hitting these issues with 10k simultaneous reads/writes, then it's reasonable to assume that many people who don't have such high work loads won't really care.
The ChatGPT snippet was included to show that that's what ChatGPT research told me. Nothing more. I basically typed a 2-line prompt and asked it to include the original article. Anyone who thinks that what I posted is authoritative in any way shouldn't be considering doing this type of work.
It really is, and it’s often surprising to me how basic some of the issues are being discovered. Like Figma, when they waited a shocking amount of time add [0] PgBouncer and read replicas. This is such a well-trod path that it’s baffling to me why you wouldn’t add it once it’s clear you have a winning product. At the very least, PgBouncer (or PgCat, or any other connection pooler / proxying service) - it adds negligible cost per month (in comparison to DB read replicas) to run a couple of containers with a load balancer.
Re: Oracle, as much as I despise the company for its litigious practices, I’ll hand it to you that the features your DB has are astonishing. RAC is absolutely incredible (on paper - I’ve never used it).
[0]: https://www.figma.com/blog/how-figma-scaled-to-multiple-data...
The DB can make much stronger guarantees about transactions and updates the closer that logic happens to itself. In the world of cloud computing, this can be a cost savings for ingress/egress too.
listen/notify isn't necessary a replacement for redis or other pub/sub systems, redis pub/sub and similar isn't necessary a replacement for idk. Kafka or similar queue/messaging system
but a lot of companies have (for modern standards) surprisingly small amounts of data, very even a increase by 2,3,4x still isn't that big. In that case listen/notify and similar might just work fine :shrug:
also same is true the other way around, depending on you application you can go redis only, as long as you data volume stays small enough and needs for transactional/sync are reasonable simple enough (with watch+exec, NX,XX options etc. and maybe some redis side lua scripts you can do quite a lot for data synchronization). Issue with that is that stylistically redis data sync/transaction code is often much more similar to writing atomic data-structures then to SQL transactions, and even for SQL transactions there is a trend of devs severely overestimating what they provide, so often you are better of not touching on it when you can avoid it, also BTW. redis has something very similar to sqlite or Notify where "basically" (oversimplified by a lot) there is only one set of writes done at a time ;) (and then afterwards distributed to replicas), just that outside of some micro lua scripts you don't really run much logic outside of some NX, XX checks etc. so it's not blocking much and it's "more or less" all just in memory not touching a WAL (again oversimplified).
The DB is - or should be - the source of truth for your application. Also, since practically everyone is using cloud RDBMS with (usually) networked storage, the latency is atrocious. Given those, it seems silly to rely on an application to react to and direct changes to related data.
For example, if you want to soft-delete customer data while maintaining the ability to hard-delete, then instead of having an is_deleted and/or deleted_at column, have a duplicate table or tables, and an AFTER DELETE trigger on the originals that move the tuples to the other tables.
Or if you want to have get_or_create without multiple round trips (and you don’t have Postgres’ MERGE … RETURNING), you can easily accomplish this with a stored procedure.
Using database features shouldn’t be seen as verboten or outdated. What should be discouraged is not treating things like stored procedures and triggers as code. They absolutely should be in VCS, should go the same review process as anything else, and should be well-documented.
RDBMS are extremely well-tested pieces of software that do their job incredibly well. To think that you could do better, or even equally as well, is hubris. If you want to trade those guarantees for “velocity” go right ahead, but you also need to take into account the inevitable incidents and recoveries that will occur.
More broadly, not knowing how to write SQL is a very solvable problem, and frankly anyone accessing an RDBMS as a regular part of their job should know it. Even if you’re always using an ORM, you should understand what it’s doing so you can understand the EXPLAIN output you’ll probably be looking at eventually.
With entity framework code first, Microsoft made it possible for generations of developers to barely touch a database.
A lot of Devs have poor database skills nowadays.
Which suits the cloud sellers who want to push managed platforms
For sharding, Vitess kind of supports them; Citus fully supports them.
You’re correct that they do impact performance to an extent, but as a counter argument, if your data is incorrect, it doesn’t matter how quickly you wrote it.
Not to mention the difficulty in maintaining referential integrity with all of that duplicated data. There are various workarounds, but at that point it feels very much like we’re recreating a shared DB, but shittily, and netting zero benefits.
Sure you need transaction about processing things in a queue (mark as "taken out", but not yet remove then remove or "place back in (or into a failed messages inbox)" on timeout or similar can be _very_ important for queue systems.
But the moment the "fail save if something dies while processing a message" becomes a directly coupled with DB transactions you have created something very brittle and cumbersome.
To be fair that might still be the best solution for some situations.
But the better solution is to make sure you treat a queue as message passing system and handle messages as messages with the appropriate delivery semantics. And if you can't because idk. idempotency logic is supper unreliable then there indeed is a problem, but its not in the missing cross transactions but how you write that logic (missing ?_tooling_, strict code guidelines people actually comply with, interface regression checks, tests (including prop/fuzz tests, regression tests, integration/e2e tests etc., not just "dump" unit test)).
> just migrate to an Oracle Database.
In my experience while Oracle DB is very powerful but also very cumbersome in a lot of ways and if you need thing only they can provide you most likely already fucked up big time somewhere else in your design/architecture. Sure if you are at that point Oracle can lightly be the cheaper solution. But still preferable you never endup there.
As a side note, there are also a lot of decent plugins which can provide similar capabilities to PG, but they tend to have the issue that they aren't part of managed PG solutions and self managing PG (or most other reasonable powerful DB) can be a huge pain, and then yes Oracle can be a solution.
Still the amount of startups which had a overall good experience are in my experience overall non existing in my experience. (But there are some pretty big companies/projects I know of which have a overall good experience with Oracle.)
> constant stories of firefights
If you mean stories on HN, than that isn't a meaningful metric, you will only hear about the "interesting" stories which mostly are about fire fighting or "using pg for everything is grate" but rarely the majority of in-between stories and boring silent successes. If it's about stories from you carriers and asking dev friends you have what their experience is then it is more meaningful. But in a bubble (like this answer of mine is, without question, in a bubble).
Generally I think people really overestimate how representative HN is, idk. about the US but outside of it _huge_ parts of the IT industry are not represented by HN in any meaningful way. I would say in my country HN is _at most_ representative for 1/4 of the industry, but that 1/4 also contains many of the very very motivated software developers. But also very few of the "that my work but not my calling", "bread and butter" work software devs, which are often 1/3 to over 1/2 of devs in most countries as far as I can tell.
But it's oversimplified a case of "high queue load f* up the availability/timings for other DB operations" (and themself).
And thats a generic problem you have, even if just due to "generic CPU/WAL/disk load" if you put your queue into your DB even iff that specific lock would be somehow solved with some atomic concurrent algorithms or similar (not sure if that even is possible).
So in general make your storage db, and queue a different service (and you cache too), even if it uses the same kind of storage. (Through technically there are clever in-between solutions which run their own queue service but still use you DB for final storage but have a ton of caching, in memory locking etc. to remove a huge part of the load from the DB. )
You only need to cover three scenarios and it's very simple to implement. Recorded added +1, Record removed -1, Record moved +1 & -1.
If you have counts that are more complicated, it doesn't work but this solution easily beats semi-frequent COUNT queries.
Really, FKs are typically implemented internally by RDBMSes as TRIGGERs that do what you expect FKs to do, which means they really are nothing more than syntactic sugar.
IMO LISTEN/NOTIFY is badly designed as an interface to begin with because there is no way to enforce access controls (who can notify; who can listen) nor is there any way to enforce payload content type (e.g., JSON). It's very unlike SQL to not have a `CREATE CHANNEL` and `GRANT` commands for dealing with authorization to listen/notify.
If you have authz then the lack of payload content type constraints becomes more tolerable, but if you add a `CREATE CHANNEL` you might as well add something there regarding payload types, or you might as well just make it so it has to always be JSON.
With a `CREATE CHANNEL` PG could provide:
- authz for listen
- authz for notify
- payload content type constraints
(maybe always JSON if you CREATE
the channel)
- select different serialization
semantics (to avoid this horrible,
no good, very bad locking behavior)
- backwards-compatibility for listen/
notify on non-created channels
The standard workflow for processing something from a queue is to keep track of all the messages you have already processed in the transactional database and simply request the remaining unprocessed messages. Often this is as simple as storing the last successfully processed message ID in the database and updating it in the same transaction that has processed the message. If an error occurs you roll the transaction back, which also rolls back the last message ID. The consumer will automatically re-request the failed message on the next attempt, giving you out of the box idempotency for at least once messaging.
"LISTEN/NOTIFY got us to this level of concurrency; here's how we diagnosed the performance cliff, and here's what we're doing now."
Which is like... cool, you were able to scale pretty far and create a lot of value before you needed to find a new solution.Or you could have a worker whose only job is to listen to the wal / logical replication stream and then NOTIFY. Being the only one to do so would not burden other transactions.
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then publish on some non-PG pubsub system.
`pg_logical_emit_message()` perpetuates/continues the lack of authz around `NOTIFY`.
I.e., the connection pool API has to be designed with this in mind.
For that matter connection pools also need to be designed with the ability to run code upon connecting to create TEMP schema elements because PG lacks GLOBAL TEMP.
> When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits.
It only serializes commits where NOTIFY was issued as part of the transaction, right? Transactions which did not call NOTIFY should not be affected?
Plus, for queues, it's so much easier to leverage database constraints and transactions to implement global concurrency limit, rate limit, and deduplication.
So pgflow is really agnostic and Postgres is it's fundamental dependency. All components are modular and ready to be adapted to other runtimes.
It's just that Supabase is what I use and I figured out this will be my first platform, but the abstraction to port to others is there!
Anyway, the article indicates that the fix was very simple and primarily in the application layer. Makes me wonder if someone was getting "creative" when they used LISTEN/NOTIFY.
I think sentiment is to use "for everything in 99% business cases", which involves few 100GB of data with some thousands QPS, and could be handled by PG very well.
Doing the extra work in stored procedures is noticeably faster than relying on triggers.
No big tech companies or unicorn type startups are using Oracle. Is your claim that they are all wrong?
>> Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business
This is why I suggest starting with some kind of normal queue / stream mechanism and columnar DB if needed. It isn't even harder than using one DB, particularly if you are using niche features.
It is arguable. Let's say your team knows Postgres well from a relational standpoint. Now they need to do something messages and require some type of notification / messaging. There is a learning curve here anyway. I'd argue they should spend it on more standard approaches which are not harder to start with. Of course, if you know that your site / service will only be used by yourself and your grandmother do whatever you want (just use a text file or better yet just call her instead).
In this case, you might have enough dead tuples across your heap that you might get a lot of HOT updates. If you are processing in insertion order, you will also probably process in heap order, and you can actually get 0 HOT updates since the other tuples in the page aren't fully dead yet. You could try using a lower fillfactor to avoid this, but that's also bad for performance so it might not help.
That said, I tend to reach for redis/rabbit or kafka relatively early depending on my specific needs and what's in use. Main use of a dbms queue historically is sending/tracking emails where the email service I had been using was having hiccups.
(I thought this was a fun puzzle, so don't take this as advice or as disagreement with your point.)
There is the option to use functions with SECURITY DEFINER to hack around this, but the cleanest way to do it (in the current API) would be to encrypt your messages on the application side using an authenticated system (eg AES-GCM). You can then apply access control to the keys. (Compromised services could still snoop on when adjacent channels were in use, however.)
As of PG16, HOT updates are tolerated against summarizing indexes, such as BRIN.
https://www.postgresql.org/docs/16/storage-hot.html
Besides, you probably don't want "done" jobs in the same table as pending or retriable jobs - as you scale up, you likely want to archive them as it provides various operational advantages, at no cost.
Now if I could specify that I want Order.customer_name to come from a specific other model and be updated automatically the ORM could automatically create a trigger to update that field when the customer table is updated.
Obviously this is a very simplistic example but there are many more, including versioning and soft deletes that could be incredibly useful. But the key is that the ORM has to generate the code for the triggers and stored procedures. Doing that manually is possible now but (a) uses a different language even than regular SQL which not everyone is familiar with, and (b) there is no type checking for what you are doing. The ORM model definitions are the main source of truth about the shape of your database, so I want to use them as such.
If you read my earlier comment properly, you'll notice a "done" column is to avoid deleting columns on the hot path and avoid dead tuples messing up the planner. I agree that a table should not contain done jobs, but then you risk running into the dead tuple problem. Both approaches are a compromise.
I'm amused at how op brags about the huge scale at which they operate, but instead of even considering fixing the issue (both for themselves and for others), they just switched to something else for pubsub.
Really the primary reason not to try stuff like this is (at least for me), feel that I won't paint myself into a corner with Postgres. I can always add a table here or a join there and things will work. If I need columnar, I use ClickHouse and NATS for messaging. I know these well but still gravitate toward Postgres because I feel it can grow in whatever direction is needed. However, it is true, I have thought about trying to just use NATS KV and make all services stateful receiving notifications when things change. It does seem that it could massively simplify some things but expect there could be some sharp edges in the face of unknown requirements. If one could just design for exactly the problem at hand it would be different but it never seems to work out like that.
Maybe throw your colleagues out the window instead if they don't know what they are talking about. I'm not anti/pro SPROC at all, but I am anti anti-reality. People that don't understand the vast differences in latencies between in process and out of process work should not exist in the industry.
Generally customers don't care about religious views. Make understanding the actual machine and associated latencies your religion instead. The reason to write a stored proc or do some processing in the database is entirely about data locality, not to keep the drooling masses from messing things up. A library is fine for that.
Data locality is king. Everything comes down to physical things such as blocks on the SSD, network interconnect, RAM, L3, L2, L1 cache and registers. Are those customer fields in the same page as whatever else you need? If so, most of the work is already done. Yes, you can save some network bandwidth transferring things that aren't needed but does it matter? It might but it might not. The key is to know what matters and reason about things from the perspective of the machines actually doing the work.
W.R.T. unicorn type startups; yes, my argument is that they are all wrong and should be using a different database. There's competitive advantage to be had there.
Updates of the lifecycle properties can also help coordinate multiple pollers so that they never work on the same item, but they can have overlapping query terms so that each poller is capable of picking up a particular item in the absence of others getting there first.
You also need some kind of lease/timeout policy to recognize orphaned items. I.e. claimed in the DB but not making progress. Workers can and should have exception handling and compensating updates to report failures and put items "back in the queue", but worst case this update may be missing. Some process, or even some human operator, needs to eventually compensate on behalf of the AWOL worker.
In my view, you always need this kind of table-scanning logic, even if using something like AMQP for work dispatch. You get in trouble when you fool yourself into imagining "exactly once" semantics actually exists. The message-passing layer could opportunistically scale out the workload, but a relational backstop can make sure that the real system of record is coherent and reflecting the business goals. Sometimes, you can just run this relational layer as the main work scheduler and skip the whole message-passing build-out.
https://www.postgresql.org/message-id/flat/CAM527d_s8coiXDA4...
https://www.postgresql.org/message-id/flat/175222328116.3157...
That means for moderate cases you do not even have to care about this. 99% of PostgreSQL instances out there are not big "scale".
As a sr. engineer is your responsibility to make a decision if you will build for "scale" from day zero or ignore this as you are mindful that this will not affect you until a certain point.
RDBMS are an old fogey tool. It takes a really old fogey to suggest storing records at fixed byte intervals directly on the disk - is that your proposed alternative? Or perhaps you grew up in the microservices era and that's already become old fogey.
Multi-master transactional databases are an open area of research, as far as I'm aware, but read-only replication is a solved problem. Therefore your write traffic, including your transaction overhead, has to fit within one server's capacity, while your read traffic can scale horizontally as much as you like.
Don't get me wrong. Correctness is a great default, much easier to reason about.
When trying to make good use of RDMBS transactional semantics, I think an important mental shift is to think of there being multiple async processing domains rather than a single magical transaction space. DB transactions are just communication events, not actual business work. This is how the relational DB can become the message broker.
The agents need to do something akin to 2-phase commit protocols to record their "intent" and their "result" across different business resources. But, for a failure-prone, web style network of agents, I would not expose actual DB 2-phase commit protocols. Instead, the relational model reifies the 2-phase-like state ambiguity of particular business resources as tuples, and the agents communicate important phases of their work process with simpler state update transactions.
It's basically the same pattern as with safe use of AMQP, just replacing one queue primitive with another. Both approaches require delayed acknowledgement patterns, so tasks can be routed to an agent but not removed from the system until after the agent reports the work complete. Either approach has an lost or orphaned task hazard if naively written to dequeue tasks earlier in the work process. An advantage of the RDBMS-based message broker is that you can use also use SQL to supervise all the lifecycle state, or even intervene to clean up after agent failures.
In this approach, don't scale-up a central RDMBS by disabling all its useful features in a mad dash for speed. Instead, think of the network of async agents (human or machine) and RDMBS message broker(s) to make for their respective traffic. This agent network and communication workload can often be partitioned to reach scaling goals. E.g. specific business resources might go into different "home" zones with distinct queues and agent pools. Their different lifecycle states do not need to exist under a single, common transaction control.