I wish the PostgreSQL community would stop chasing more frontend features and spend a concerted few years completely renovating their storage layer. The effort in each release seems massively and disproportionately skewed towards frontend improvements without the will to address these fundamental issues.
It's absurd that in 2024, "the world's most advanced open source database" doesn't have a method of doing upgrades between major versions that doesn't involve taking the database down.
Yes, logical replication exists, but it still doesn't do DDL, so it has big caveats attached.
For large instances this is a big ask, especially of a project without single person in charge. MySQL does have better replication, yet still often requires manually setting that up and cutting it over to do major version upgrades.
In practice, the only way to change the fundamental architecture of a database is to write a new one, with everything that entails.
It seems like one of the biggest fundamental flaws is that Postgres chose the O2N approach for tracking row versions instead of N2O. While switching to N2O wouldn't solve all problems (e.g. the article also talks about how Postgres stores full row copies and not just diffs), from an "80/20 rule" perspective, it seems like it would get rid of most of the downsides with the current implementation. For example, I'd assume that the vast majority of the time that transactions want the latest row version, so using the N2O ordering means you could probably do away with storing each row version in an index, as you'd only need to traverse the linked list of you needed an older version, which should be much less common.
The Part of PostgreSQL We Hate the Most (2023)
Doesn't git famously _not_ store diffs and instead follows the same storage pattern postgres uses here and stores the full new and old objects?
Every application is a bit different, but it's not that the PostgreSQL design is a loser in all regards. It's not like bubble sort.
the article literally says that pg's mvcc design is from the 90s and no one does it like that any more. that is technology that is outdated by over 30 years. i'd say it does not make it a loser in all regards, but in the most important aspects.
Periodically (I believe it used to be every thousand commits, though I'm not sure what the heuristic is today), git will take the loose objects and compress them into a pack.
The full blob format is how objects are manipulated by git internally: to do anything useful, the objects need to be extracted from the blob, with all deltas applied, before anything can be done with them.
It's also worth nothing that accessing a deltified object is slow (O(n) in the number of deltas), so the length of the delta chain is limited. Because deltification is really just a compression format, it doesn't matter how or where the deltas are done -- the trivial "no deltas" option will work just fine if you want to implement that.
You can trivially verify this by creating commits and looking in '.git/objects/*' for loose objects, running 'git repack', and then looking in '.git/objects/pack' for the deltified packs.
2. git storage does compress, and the compression is "diff-based" of sorts, but it is not based on commit history as one might naively expect.
Though full tuples is pretty fundamental to the underlying implementation....MVCC, VACUUM, etc. It'd be a massive change to say the least.
I’d prefer not to be the first person running up against a limit or discovering a bug in my DB software.
So, newer not always means better, just saying
Interesting behavior of MySQL that I have observed (~500GB database, with a schema that is more of an document oriented than relational) is that when you update single row doing SELECT id WHERE something; UPDATE what WHERE id=id is orders of magnitudes faster than UPDATE what WHERE something. I somehow suspect that this is the reason for this behavior. But well, the normal workload will not do that and this only slows down ad-hoc DML when you fix some inconsistency.
Oriole has joined us at supabase now and it’s being worked on full time by Alexander and his team. Here is the patch set:
https://www.orioledb.com/docs#patch-set
It will be available to try on the supabase platform later this year too
Flat files have also been reliably used in production for decades. That doesn't mean they're ideal...although amusingly enough s3 and its equivalent of flat files is what we've migrated to as a data store.
> The goal of MVCC in a DBMS is to allow multiple queries to read and write to the database simultaneously without interfering with each other when possible.
How is that a problem for most use cases?
If there is a read query which is taking a long time, with many rows, and some of these later rows happen to be updated mid-read but the earlier rows are not... It's not really a problem for the vast majority of application. Why is it better for all rows to be delivered out of date versus just the first half fetched being out of date? It's not ideal in either case but it's unavoidable that some requests can sometimes return out of date data. It seems like a tiny advantage.
I suspect the real need to implement MVCC arose out of the desire for databases like Postgres to implement atomic transactions as a magical black box.
IMO, two-phase commit is a simpler solution to this problem. It's not possible to fully hide concurrency complexity from the user; it ends up with tradeoffs.
Off topic, it was marketing all along: https://news.ycombinator.com/item?id=15124306
> It's not really a problem for the vast majority of application.
This is true, but I don't even want to think about when it is indeed not really a problem and in the few cases when it is a problem.
Not sure how PG implements it, but I tried it in a case where I did need it in SQLAnywhere, and only found out a bit too late that while the docs stated it was very detrimental to performance, the docs didn't explicitly say why, and it was much worse than I had assumed.
I assumed it meant the transaction would lock the table, do it's thing and release on commit/rollback. And of course, that would hurt performance a lot if there was high contention. But no, that's not what it did. It was much, much worse.
Instead of taking a lock on the whole table, it locked all the rows. Which went as swimmingly as you could expect on a table with thousands upon thousands of rows.
Not sure why they did it this way, but yeah had to ditch that and went with the good old retry loop.
“ In the 2000s, the conventional wisdom selected MySQL because rising tech stars like Google and Facebook were using it. Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“. In the last five years, PostgreSQL has become the Internet’s darling DBMS. And for good reasons! It’s dependable, feature-rich, extensible, and well-suited for most operational workloads.”
Smart engineers were choosing postgres not because of the logical fallacy of popularum, but for the following reasons:
Data safety - not MyIsam, ACID, Similarity to Oracle, MVCC, SQL standards adherence, Postgres team, Helpful awesome community, Data types, High performance, BSD flexibility
Above are the reasons I selected Postgres while at ATT early 2000s and our Oracle DBA found it a very easy transition. While Mysql went through rough transitions, PG has gone from strength to strength and ever improving path.
I think Bruce Momjian is a big part of this success; they truly have an excellent community. <3
While persisting key architectural ideas certainly has benefits, so does evolving their implementations.
Are you referring to C++? That was actually created by a Danish guy, who was also inspired by the object oriented Simula language created in the 60s
We solved the 2x storage with partitions, but it feels like the tail wagging the dog
For others who are curious:
> But please don’t misunderstand our diatribe to mean that we don’t think you should ever use PostgreSQL. Although its MVCC implementation is the wrong way to do it, PostgreSQL is still our favorite DBMS. To love something is to be willing to work with its flaws.
I wonder if CRDB (or other newer designed DBs) has circumvented those issues? Or don't we just hear from those issues as CRDB and the other newer DBs are not that widely used and mainly in the commercial space?
So from reading https://www.postgresql.org/docs/17/transaction-iso.html#XACT... we can tell that using serializable transactions only locks data actually used.
It's also apache2 license.
You know, I was wondering something regarding this write amplification. It's true that MySQL doesn't need to update its indexes like that. However, MySQL replication relies on the binlog, where every change has to be written in addition to the database itself (InnoDB redo log and so on). So, it seems to me, MySQL, if used in a cluster, has a different kind of write amplification. One that PostgreSQL does not have, because it reuses its WAL for the replication.
In addition, on the receiving side, MySQL first writes the incoming binlog to the relay log. The relay log is then consumed by the applier threads, creating more InnoDB writes and (by default) more binlog.
The preference kept growing thanks to data safety, DDL's in transactions, etc.
The problem is of course making changes offline that the user assumes are permanent but then later, when sync time comes, then turn out to conflict with changes made in the meantime. So changes can’t be made permanent. Either that requires difficult UX to reconcile or something that always will give you something consistent, like a crdt.
[0]: https://ditto.live/
[0]: https://www.postgresql.org/docs/current/ddl-constraints.html
You might end up with even more than that due to filesystem metadata (inode records, checksums), metadata of an underlying RAID mechanism or, when working via some sort of networking, stuff like ethernet frame sizes/MTU.
In an ideal world, there would be a clear interface which a program can use to determine for any given combination of storage media, HW RAID, transport layer (local attach vs stuff like iSCSI or NFS), SW RAID (i.e. mdraid), filesystem and filesystem features what the most sensible minimum changeable unit is to avoid unnecessary write amplification bloat.
When doing game development in the early 2000s I learned that bubble sort is not a loser in all regards. It performs well when a list is usually almost sorted. One situation when this is the case is in 3D rendering, which sorts objects by their distance from the camera. As you move the camera around or rotate it, bubble sort works very well for re-sorting the objects given the order they had in the previous frame.
To prevent bad worst-case scenarios you can count the number of comparisons that failed on the last pass and the number of passes you have performed so far, then switch to a different sort algorithm after reaching a threshold.
Firebase and cognito/appsync work this way, basically.
You can use any data store you want on the server to do that. You could theoretically push a local sqlite db up to s3 as a sync mechanism, I suppose, if you do the locking correctly.
The article mentioned that there are nearly 900 different databases on the market. I am trying to build yet another one using a unique architecture I developed. It is very fast, but although it is designed for transactions; I haven't implemented them yet.
I think if I spend the time and effort to do it right, this could be a real game changer (I think the architecture lends itself very well to a superior implementation); but I don't want to waste too much time on it if people don't really care one way or the other.
Most source control systems have some common logical concepts (e.g. files and directories), but there's actually significant divergence between their logical models. For instance:
- Classic Perforce (as opposed to Perforce Streams) has a branching model that's very different from Git's; "branches" are basically just directories, and branching/merging is tracked on a per-file basis rather than a per-commit basis. It also tracks revisions by an incrementing ID rather than hashes. - Darcs and Pijul represent the history of a file as an unordered set of patches; a "branch" is basically just a set of patches to apply to the file's initial (empty) state.
All of that is above the physical state, which also differs:
- Perforce servers track files' revision histories in a directory hierarchy that mirrors the repository's file structure rather than building a pseudo-directory hierarchy over a flat object store. - Fossil stores everything in an SQLite database.
> Is that relevant to something?
Yes. You can use a VCS reasonably effectively if you understand its logical model but not its physical storage model. It doesn't work so well the other way around.
Long transactions can also cause surprising locks, because many locks taken persist to the end of the transaction, even if the transaction is no longer doing anything. This can block DDL operations as well as things like REINDEX.
Linus began work on it in April 1991: https://groups.google.com/g/comp.os.minix/c/dlNtH7RRrGA/m/_R...
hopeful performance has improved since
Many people consider this most expensive bug in history, when on first flight of Ariane 5, it enters speed range, which was hard prohibited in Ariane 4 software and caused software exception and then 1 billion crashed.
Honesty, they could re-check all ranges, but they decided, it would cost like write new software, so to save money, was made decision to just use old software, without additional checks.
In Postgres, even with the serializable isolation level, all transactions that touch the same rows must also be using the serializable isolation level or it's not really enforced. This is one aspect of serializable isolation in Postgres that seemed like a major gotcha for real world application development. There's no future proof solution: new code can be added that doesn't use the serializable isolation, and then the assumptions of isolation from the earlier code are no longer valid.
FOR UPDATE is the only real solution in my eyes