Most active commenters

(5)
kunley(4)
emptiestplace(4)
jeltz(4)
fforflo(3)
dfox(3)
srcreigh(3)
sgarland(3)
didgetmaster(3)

Popular/hot comments

>>41896827 #
>>41897421 #
>>41896249 #
>>41897588 #
>>41897732 #
>>41903373 #
>>41898303 #
>>41898410 #
>>41905138 #

MVCC – the part of PostgreSQL we hate the most (2023)

(www.cs.cmu.edu)

1. terminalbraid ◴[20 Oct 24 16:03 UTC] No.41896249[source]▶

>>41895951 (OP) #

https://news.ycombinator.com/item?id=41892830

replies(4): >>41897128 #>>41897142 #>>41898153 #>>41898993 #

2. hlandau ◴[20 Oct 24 17:06 UTC] No.41896827[source]▶

>>41895951 (OP) #

With every new PostgreSQL release we see yet more features and sugar added to the frontend, yet seemingly no meaningful improvement to the backend/storage layer which suffers these fundamental problems.

I wish the PostgreSQL community would stop chasing more frontend features and spend a concerted few years completely renovating their storage layer. The effort in each release seems massively and disproportionately skewed towards frontend improvements without the will to address these fundamental issues.

It's absurd that in 2024, "the world's most advanced open source database" doesn't have a method of doing upgrades between major versions that doesn't involve taking the database down.

Yes, logical replication exists, but it still doesn't do DDL, so it has big caveats attached.

replies(6): >>41896999 #>>41897049 #>>41897077 #>>41897182 #>>41905260 #>>41907744 #

3. factormeta ◴[20 Oct 24 17:30 UTC] No.41896999[source]▶

>>41896827 #

what about https://github.com/orioledb/orioledb ?

4. paulryanrogers ◴[20 Oct 24 17:37 UTC] No.41897049[source]▶

>>41896827 #

> a method of doing upgrades between major versions that doesn't involve taking the database down.

For large instances this is a big ask, especially of a project without single person in charge. MySQL does have better replication, yet still often requires manually setting that up and cutting it over to do major version upgrades.

5. jandrewrogers ◴[20 Oct 24 17:41 UTC] No.41897077[source]▶

>>41896827 #

The design of good storage layers in databases is deeply architectural. As a consequence, it is essentially a "forever" design decision. Fundamentally changing the storage architecture will alter the set of tradeoffs being made such that it will break the assumptions of existing user applications, which is generally considered a Very Bad Thing. The existing architecture, with all its quirks and behaviors, is part of the public API (see also: Hyrum's Law).

In practice, the only way to change the fundamental architecture of a database is to write a new one, with everything that entails.

replies(1): >>41901775 #

6. hn_throwaway_99 ◴[20 Oct 24 17:43 UTC] No.41897093[source]▶

>>41895951 (OP) #

Wow, as someone who feels like I'm decently familiar with the ins and outs of Postgres, I thought this was a great article and I learned a ton.

It seems like one of the biggest fundamental flaws is that Postgres chose the O2N approach for tracking row versions instead of N2O. While switching to N2O wouldn't solve all problems (e.g. the article also talks about how Postgres stores full row copies and not just diffs), from an "80/20 rule" perspective, it seems like it would get rid of most of the downsides with the current implementation. For example, I'd assume that the vast majority of the time that transactions want the latest row version, so using the N2O ordering means you could probably do away with storing each row version in an index, as you'd only need to traverse the linked list of you needed an older version, which should be much less common.

replies(1): >>41904048 #

7. metadat ◴[20 Oct 24 17:47 UTC] No.41897128[source]▶

>>41896249 #

Thanks for pointing this out, it's an existing discussion started half a day ago with 5 comments:

The Part of PostgreSQL We Hate the Most (2023)

https://news.ycombinator.com/item?id=41892830

8. ◴[20 Oct 24 17:49 UTC] No.41897140[source]▶

>>41895951 (OP) #

9. ◴[20 Oct 24 17:49 UTC] No.41897142[source]▶

>>41896249 #

10. quotemstr ◴[20 Oct 24 17:56 UTC] No.41897182[source]▶

>>41896827 #

Better yet: decouple front and back ends. Let them talk over a stable interface and evolve independently. The SQLite ecosystem is evolving in this direction, in fits and starts.

replies(2): >>41897304 #>>41909460 #

11. globular-toast ◴[20 Oct 24 18:09 UTC] No.41897284[source]▶

>>41895951 (OP) #

The article says the benefit of O2N is there's no need to immediately update indexes, but then goes on to say postgres updates the indexes anyway! So is there actually any advantage to O2N at all?

replies(2): >>41897608 #>>41897609 #

12. anarazel ◴[20 Oct 24 18:12 UTC] No.41897304{3}[source]▶

>>41897182 #

You can implement a different mvcc model today, without patching code.

13. nightfly ◴[20 Oct 24 18:26 UTC] No.41897421[source]▶

>>41895951 (OP) #

> MySQL and Oracle store a compact delta between the new and current versions (think of it like a git diff).

Doesn't git famously _not_ store diffs and instead follows the same storage pattern postgres uses here and stores the full new and old objects?

replies(6): >>41897457 #>>41897486 #>>41897759 #>>41897885 #>>41899164 #>>41899189 #

14. jmholla ◴[20 Oct 24 18:31 UTC] No.41897457[source]▶

>>41897421 #

That is correct. Each version of a file is a separate blob. There is some compression done by packing to make cloning faster, but the raw for git works with is these blobs.

replies(2): >>41897535 #>>41898446 #

15. ChadNauseam ◴[20 Oct 24 18:36 UTC] No.41897486[source]▶

>>41897421 #

TBF, the quoted section doesn't say that git stores diffs (or anything about git storage), it just says that what MySQL and Oracle stores is similar to a git diff.

replies(2): >>41900208 #>>41906576 #

16. vinnymac ◴[20 Oct 24 18:36 UTC] No.41897492[source]▶

>>41895951 (OP) #

The part I hate the most is that in 2024 I still need a connection pooler (such as pgbouncer) in front of it to make it usable.

replies(1): >>41901827 #

17. avg_dev ◴[20 Oct 24 18:43 UTC] No.41897553[source]▶

>>41895951 (OP) #

pretty informative. now i understand why people are often having issues with vacuum-related stuff. i like the diagrams too.

18. fweimer ◴[20 Oct 24 18:48 UTC] No.41897588[source]▶

>>41895951 (OP) #

The big advantage is that you do not need any extra space if your workload mostly consists of INSERTs (followed by table drops). And it's generally unnecessary to split up insertion transactions because there is no size limit as such (neither on the generated data or the total count of rows changed). There is a limit on statements in a transaction, but you can sidestep that by using COPY FROM if you do not have to switch tables too frequently. From a DBA point of view, there is no need to manage a rollback/undo space separately from table storage.

Every application is a bit different, but it's not that the PostgreSQL design is a loser in all regards. It's not like bubble sort.

replies(4): >>41897732 #>>41902493 #>>41904079 #>>41905601 #

19. apavlo ◴[20 Oct 24 18:51 UTC] No.41897608[source]▶

>>41897284 #

If all table pages exist in memory and you are using cooperative GC, then O2N can be preferable. As workers scan version chains, they can clean up dead tuples without taking additional locks.

This is what Microsoft Hekaton does.

20. kikimora ◴[20 Oct 24 18:52 UTC] No.41897609[source]▶

>>41897284 #

Good question! Also they point out that famous Uber article erroneously mentions write amplification caused by what they thought was N2O. IDK if write amplification is real or not. But if it is really O2N then there is no apparent reason for write amplification and entire Uber article might had been based on the wrong premise.

replies(1): >>41909499 #

21. indulona ◴[20 Oct 24 19:09 UTC] No.41897732[source]▶

>>41897588 #

> but it's not that the PostgreSQL design is a loser in all regards

the article literally says that pg's mvcc design is from the 90s and no one does it like that any more. that is technology that is outdated by over 30 years. i'd say it does not make it a loser in all regards, but in the most important aspects.

replies(4): >>41898162 #>>41898235 #>>41898303 #>>41902986 #

22. derefr ◴[20 Oct 24 19:12 UTC] No.41897751[source]▶

>>41895951 (OP) #

Question: is storing full new row-tuple versions something fundamental to Postgres as a whole, or is it just a property of the default storage engine / “table access method”?

replies(2): >>41897917 #>>41898523 #

23. Hendrikto ◴[20 Oct 24 19:13 UTC] No.41897759[source]▶

>>41897421 #

Git diffs are generated on the fly, but diffs are still diffs.

24. simonw ◴[20 Oct 24 19:15 UTC] No.41897771{4}[source]▶

>>41897535 #

Saying "that's incorrect" is a lot more productive than saying "that's a lie".

Calling something a lie implies that the incorrect information was deliberate.

25. vrosas ◴[20 Oct 24 19:20 UTC] No.41897807[source]▶

>>41897714 #

I’ve been trashed for suggesting as much. People here seem to think any team or startup not choosing Postgres is committing some type of negligence.

replies(1): >>41898201 #

26. ori_b ◴[20 Oct 24 19:21 UTC] No.41897810{4}[source]▶

>>41897535 #

Git does both. When you create a commit, it stores a full (zipped) copy of the object, without any deltas.

Periodically (I believe it used to be every thousand commits, though I'm not sure what the heuristic is today), git will take the loose objects and compress them into a pack.

The full blob format is how objects are manipulated by git internally: to do anything useful, the objects need to be extracted from the blob, with all deltas applied, before anything can be done with them.

It's also worth nothing that accessing a deltified object is slow (O(n) in the number of deltas), so the length of the delta chain is limited. Because deltification is really just a compression format, it doesn't matter how or where the deltas are done -- the trivial "no deltas" option will work just fine if you want to implement that.

You can trivially verify this by creating commits and looking in '.git/objects/*' for loose objects, running 'git repack', and then looking in '.git/objects/pack' for the deltified packs.

27. paulddraper ◴[20 Oct 24 19:32 UTC] No.41897885[source]▶

>>41897421 #

1. The comparison was to MySQL and Oracle storage using git diff format as an analogy, not git storage.

2. git storage does compress, and the compression is "diff-based" of sorts, but it is not based on commit history as one might naively expect.

28. haradion ◴[20 Oct 24 19:32 UTC] No.41897887{4}[source]▶

>>41897535 #

The file contents are logically distinct blobs. Packfiles will aggregate and delta-compress similar blobs, but that's all at a lower level than the logical model.

replies(1): >>41902053 #

29. msie ◴[20 Oct 24 19:33 UTC] No.41897894[source]▶

>>41895951 (OP) #

Nicely written article. Easy to read and understand!

30. paulddraper ◴[20 Oct 24 19:37 UTC] No.41897917[source]▶

>>41897751 #

You could store partial tuples without disrupting the interface.

Though full tuples is pretty fundamental to the underlying implementation....MVCC, VACUUM, etc. It'd be a massive change to say the least.

31. fforflo ◴[20 Oct 24 19:57 UTC] No.41898053[source]▶

>>41895951 (OP) #

OrioleDB was supposed to tackle this problem with a new storage engine . https://github.com/orioledb/orioledb

replies(1): >>41898410 #

32. infogulch ◴[20 Oct 24 20:12 UTC] No.41898153[source]▶

>>41896249 #

@dang maybe merge these threads?

replies(1): >>41901201 #

33. naranha ◴[20 Oct 24 20:14 UTC] No.41898162{3}[source]▶

>>41897732 #

At least couchdb is also append only with vacuum. So it's maybe not completely outdated.

replies(1): >>41898658 #

34. rnewme ◴[20 Oct 24 20:20 UTC] No.41898201{3}[source]▶

>>41897807 #

To be fair that's sort of a possotive self fulfilling propercy. The more it's used the better it's going to get.

35. thr0w ◴[20 Oct 24 20:27 UTC] No.41898232[source]▶

>>41895951 (OP) #

Don't agree with their characterization of `pg_repack`. `VACUUM FULL` is definitely crushing, but that's why repack exists as a faster/lighter alternative. Anyone have a different experience?

replies(2): >>41898486 #>>41899466 #

36. mikeocool ◴[20 Oct 24 20:29 UTC] No.41898235{3}[source]▶

>>41897732 #

When it comes to your data store, some people might consider using technology that’s been reliably used in production by many organizations for 30 years a feature not a bug.

I’d prefer not to be the first person running up against a limit or discovering a bug in my DB software.

replies(2): >>41898511 #>>41908322 #

37. kunley ◴[20 Oct 24 20:43 UTC] No.41898303{3}[source]▶

>>41897732 #

Still I am very happy to use every day the technology designed in early 70s by Ken Thompson and colleagues, so far in that specific field many tried to invent something more "modern" and "better" and failed, with an exception of a certain Finnish clone of that tech, also started in 80s by the way.

So, newer not always means better, just saying

replies(3): >>41899219 #>>41899463 #>>41907434 #

38. dfox ◴[20 Oct 24 20:47 UTC] No.41898326[source]▶

>>41895951 (OP) #

> Oracle and MySQL do not have this problem in their MVCC implementation because their secondary indexes do not store the physical addresses of new versions. Instead, they store a logical identifier (e.g., tuple id, primary key) that the DBMS then uses to look up the current version’s physical address. Now this may make secondary index reads slower since the DBMS has to resolve a logical identifier, but these DBMS have other advantages in their MVCC implementation to reduce overhead.

Interesting behavior of MySQL that I have observed (~500GB database, with a schema that is more of an document oriented than relational) is that when you update single row doing SELECT id WHERE something; UPDATE what WHERE id=id is orders of magnitudes faster than UPDATE what WHERE something. I somehow suspect that this is the reason for this behavior. But well, the normal workload will not do that and this only slows down ad-hoc DML when you fix some inconsistency.

replies(2): >>41898716 #>>41902118 #

39. kiwicopple ◴[20 Oct 24 21:00 UTC] No.41898410[source]▶

>>41898053 #

(I’m on the Supabase team)

Oriole has joined us at supabase now and it’s being worked on full time by Alexander and his team. Here is the patch set:

https://www.orioledb.com/docs#patch-set

It will be available to try on the supabase platform later this year too

replies(3): >>41898435 #>>41899346 #>>41899603 #

40. fforflo ◴[20 Oct 24 21:03 UTC] No.41898435{3}[source]▶

>>41898410 #

Yeah, I've been keeping an eye on the pgsql-hackers discussions. Alexander+team are doing great work.

41. quotemstr ◴[20 Oct 24 21:04 UTC] No.41898446{3}[source]▶

>>41897457 #

git's model is a good example of layered architecture. Most of the code works in terms of whole blobs. The blob storage system, as an implementation detail, stores some blobs with diffs. The use of diffs doesn't leak into the rest of the system. Good separation of concerns

42. dfox ◴[20 Oct 24 21:08 UTC] No.41898486[source]▶

>>41898232 #

pg_repack is an hack-ish solution to do what VACUUM FULL does without completely locking the relation in question. But well, when you care about either of these things, your workload has significant issues, with the typical case being using pgsql as a backend for something that was originally a thick client designed for some kind of RDBMS based on shared files (InterBase mentioned in TFA, MS Jet whatever…)

43. mannyv ◴[20 Oct 24 21:12 UTC] No.41898511{4}[source]▶

>>41898235 #

Well every product has issues. The question is, do you feel like dealing with those issues or not?

Flat files have also been reliably used in production for decades. That doesn't mean they're ideal...although amusingly enough s3 and its equivalent of flat files is what we've migrated to as a data store.

replies(1): >>41908961 #

44. dfox ◴[20 Oct 24 21:13 UTC] No.41898523[source]▶

>>41897751 #

From the PQ protocol PoV the way how this works is pretty much irrelevant, but the actual implementation of PostgreSQL contains ridiculous amount of places that depend on the “backward” MVCC implementation of the tuple heaps.

45. jongjong ◴[20 Oct 24 21:16 UTC] No.41898538[source]▶

>>41895951 (OP) #

For most cases, MVCC sounds like over-engineering. From the problem description:

> The goal of MVCC in a DBMS is to allow multiple queries to read and write to the database simultaneously without interfering with each other when possible.

How is that a problem for most use cases?

If there is a read query which is taking a long time, with many rows, and some of these later rows happen to be updated mid-read but the earlier rows are not... It's not really a problem for the vast majority of application. Why is it better for all rows to be delivered out of date versus just the first half fetched being out of date? It's not ideal in either case but it's unavoidable that some requests can sometimes return out of date data. It seems like a tiny advantage.

I suspect the real need to implement MVCC arose out of the desire for databases like Postgres to implement atomic transactions as a magical black box.

IMO, two-phase commit is a simpler solution to this problem. It's not possible to fully hide concurrency complexity from the user; it ends up with tradeoffs.

replies(1): >>41898609 #

46. thih9 ◴[20 Oct 24 21:16 UTC] No.41898543[source]▶

>>41895951 (OP) #

> Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“.

Off topic, it was marketing all along: https://news.ycombinator.com/item?id=15124306

replies(1): >>41899286 #

47. kccqzy ◴[20 Oct 24 21:26 UTC] No.41898609[source]▶

>>41898538 #

One person's over engineering is another person's essential feature. I personally like the fact that Postgres supports the serializable isolation level that simplifies application programming.

> It's not really a problem for the vast majority of application.

This is true, but I don't even want to think about when it is indeed not really a problem and in the few cases when it is a problem.

replies(1): >>41898746 #

48. jbellis ◴[20 Oct 24 21:33 UTC] No.41898658{4}[source]▶

>>41898162 #

High performance has never been a reason to use couchdb.

49. whazor ◴[20 Oct 24 21:40 UTC] No.41898716[source]▶

>>41898326 #

A SELECT is a readonly operation and can be performed in parallel. However, an UPDATE actually writes and might lock the table. Whereas UPDATE id=id allows for row level locking. There is also the risk of missing newly inserted records between the SELECT and the UPDATE.

replies(1): >>41898733 #

50. nine_k ◴[20 Oct 24 21:43 UTC] No.41898733{3}[source]▶

>>41898716 #

SELECT FOR UPDATE was invented to address this,

replies(2): >>41898851 #>>41905546 #

51. magicalhippo ◴[20 Oct 24 21:46 UTC] No.41898746{3}[source]▶

>>41898609 #

> I personally like the fact that Postgres supports the serializable isolation level that simplifies application programming.

Not sure how PG implements it, but I tried it in a case where I did need it in SQLAnywhere, and only found out a bit too late that while the docs stated it was very detrimental to performance, the docs didn't explicitly say why, and it was much worse than I had assumed.

I assumed it meant the transaction would lock the table, do it's thing and release on commit/rollback. And of course, that would hurt performance a lot if there was high contention. But no, that's not what it did. It was much, much worse.

Instead of taking a lock on the whole table, it locked all the rows. Which went as swimmingly as you could expect on a table with thousands upon thousands of rows.

Not sure why they did it this way, but yeah had to ditch that and went with the good old retry loop.

replies(1): >>41900778 #

52. OrvalWintermute ◴[20 Oct 24 21:51 UTC] No.41898780[source]▶

>>41895951 (OP) #

This article is incorrect IMO - the following section in particular.

“ In the 2000s, the conventional wisdom selected MySQL because rising tech stars like Google and Facebook were using it. Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“. In the last five years, PostgreSQL has become the Internet’s darling DBMS. And for good reasons! It’s dependable, feature-rich, extensible, and well-suited for most operational workloads.”

Smart engineers were choosing postgres not because of the logical fallacy of popularum, but for the following reasons:

Data safety - not MyIsam, ACID, Similarity to Oracle, MVCC, SQL standards adherence, Postgres team, Helpful awesome community, Data types, High performance, BSD flexibility

Above are the reasons I selected Postgres while at ATT early 2000s and our Oracle DBA found it a very easy transition. While Mysql went through rough transitions, PG has gone from strength to strength and ever improving path.

I think Bruce Momjian is a big part of this success; they truly have an excellent community. <3

replies(1): >>41903184 #

53. fipar ◴[20 Oct 24 22:03 UTC] No.41898851{4}[source]▶

>>41898733 #

Or just select + update in a transaction, which with IIRC, with the default isolation level will use optimistic locking for the select part, unlike select for update.

replies(1): >>41910708 #

54. arp242 ◴[20 Oct 24 22:21 UTC] No.41898972{4}[source]▶

>>41897535 #

Sjeez, tone it down. People can be incorrect without lying.

55. evanelias ◴[20 Oct 24 22:24 UTC] No.41898993[source]▶

>>41896249 #

The overall topic was also discussed extensively in this subthread from 6 days ago:

https://news.ycombinator.com/item?id=41837317

56. ◴[20 Oct 24 22:56 UTC] No.41899164[source]▶

>>41897421 #

57. epcoa ◴[20 Oct 24 23:00 UTC] No.41899189[source]▶

>>41897421 #

Others have mentioned that it said “git diffs”. However git does use deltas in pack files as a low level optimization, similar to the MySQL comparison. You don’t get back diffs from a SQL query either.

58. nine_k ◴[20 Oct 24 23:05 UTC] No.41899219{4}[source]▶

>>41898303 #

Speaking of which, if you try an actual System V in an emulator, or look at C code in K&R style, certain progress, as in "much more actually usable", can be noticed.

While persisting key architectural ideas certainly has benefits, so does evolving their implementations.

replies(1): >>41918627 #

59. Hilift ◴[20 Oct 24 23:15 UTC] No.41899286[source]▶

>>41898543 #

It was designed by former DoubleClick engineers as an afterthought DIY db for another service because no other db met their requirements. Supposedly version 4.2.8 (2020) is fairly solid, i.e. no dirty writes. https://en.wikipedia.org/wiki/MongoDB#Technical_criticisms

60. justinclift ◴[20 Oct 24 23:23 UTC] No.41899346{3}[source]▶

>>41898410 #

As a data point, there are easily noticeable typos on that docs page. Might be a good idea to have someone run a spell checker over it at some point?

61. throwawayie6 ◴[20 Oct 24 23:42 UTC] No.41899463{4}[source]▶

>>41898303 #

> exception of a certain Finnish clone of that tech

Are you referring to C++? That was actually created by a Danish guy, who was also inspired by the object oriented Simula language created in the 60s

replies(1): >>41899553 #

62. fovc ◴[20 Oct 24 23:42 UTC] No.41899466[source]▶

>>41898232 #

The requirement for having two copies of the table simultaneously on systems that make it easy to add but not subtract storage. Otherwise pg_repack has worked really well.

We solved the 2x storage with partitions, but it feels like the tail wagging the dog

63. nneonneo ◴[21 Oct 24 00:04 UTC] No.41899553{5}[source]▶

>>41899463 #

Pretty sure the OP was referring to UNIX and its “Finnish clone” Linux.

64. philippemnoel ◴[21 Oct 24 00:15 UTC] No.41899603{3}[source]▶

>>41898410 #

The whole ParadeDB team is really excited for OrioleDB and Supabase to ship this :) It's long overdue in the Postgres ecosystem!

65. halayli ◴[21 Oct 24 00:57 UTC] No.41899794[source]▶

>>41895951 (OP) #

This topic cannot be discussed alone without talking about disks. SSDs write 4k page at a time. Meaning if you're going to update 1 bit, the disk will read 4k, you update the bit, and it writes back a 4k page in a new slot. So the penalty for copying varies depending on the disk type.

replies(2): >>41900275 #>>41904085 #

66. mbreese ◴[21 Oct 24 01:48 UTC] No.41900026[source]▶

>>41895951 (OP) #

So, if this is such a problem, my question is — are the poor MVCC choices of Postgres enough to make the authors (or people here) recommend another RDBMS?

replies(1): >>41900276 #

67. zdragnar ◴[21 Oct 24 02:34 UTC] No.41900208{3}[source]▶

>>41897486 #

It's a little too easy to misinterpret if you're skimming and still have memories of working with SVN, mercurial, perforce, and probably others (I've intentionally repressed everything about tfvc).

68. srcreigh ◴[21 Oct 24 02:54 UTC] No.41900275[source]▶

>>41899794 #

Postgres pages are 8kb so the point is moot.

replies(2): >>41901535 #>>41901808 #

69. abenga ◴[21 Oct 24 02:54 UTC] No.41900276[source]▶

>>41900026 #

The last couple of paragraphs of the article answer this. (The answer is No).

replies(1): >>41900542 #

70. srcreigh ◴[21 Oct 24 02:59 UTC] No.41900307[source]▶

>>41895951 (OP) #

It’s really annoying to see people write that Postgres has a “primary index” and “secondary indexes”. No. That’s not what those words mean. Every index in Postgres is a secondary index.

replies(1): >>41903808 #

71. mbreese ◴[21 Oct 24 03:55 UTC] No.41900542{3}[source]▶

>>41900276 #

Thanks - I completely missed the “concluding remarks” paragraph the first time. After the “problems” sections, I apparently just stopped reading.

For others who are curious:

> But please don’t misunderstand our diatribe to mean that we don’t think you should ever use PostgreSQL. Although its MVCC implementation is the wrong way to do it, PostgreSQL is still our favorite DBMS. To love something is to be willing to work with its flaws.

72. WuxiFingerHold ◴[21 Oct 24 04:11 UTC] No.41900626[source]▶

>>41895951 (OP) #

Putting aside the license / closed source issues with CockroachDB (CRDB) and just focus at it technically: CRDB uses MVVM too, but its storage is a key-value store. I know it uses some kind of garbage collection to remove the old versions.

I wonder if CRDB (or other newer designed DBs) has circumvented those issues? Or don't we just hear from those issues as CRDB and the other newer DBs are not that widely used and mainly in the commercial space?

replies(2): >>41901017 #>>41901777 #

73. sitharus ◴[21 Oct 24 04:51 UTC] No.41900778{4}[source]▶

>>41898746 #

One of the best things about postgresql is the documentation. They document not only the features, but the constraints and considerations for using it and why they exist.

So from reading https://www.postgresql.org/docs/17/transaction-iso.html#XACT... we can tell that using serializable transactions only locks data actually used.

replies(1): >>41901173 #

74. qaq ◴[21 Oct 24 05:48 UTC] No.41901017[source]▶

>>41900626 #

CRDB will be slower than PG on the same hardware.With CRDB you are trading off performance for ability to scale horizontally.

75. magicalhippo ◴[21 Oct 24 06:20 UTC] No.41901173{5}[source]▶

>>41900778 #

Yea that's much, much better. I also note that it goes for the retry instead of locking.

76. wordofx ◴[21 Oct 24 06:26 UTC] No.41901201{3}[source]▶

>>41898153 #

77. bluedonuts ◴[21 Oct 24 07:04 UTC] No.41901448[source]▶

>>41895951 (OP) #

Loved this post! Could anyone recommend a book (or other medium) with similar content about RMBDS internals?

replies(2): >>41904497 #>>41910345 #

78. olavgg ◴[21 Oct 24 07:19 UTC] No.41901535{3}[source]▶

>>41900275 #

The default is 8kb, but it can be recompiled for 4kb-32kb, I actually prefer 32kb because with ZSTD compression, it will most likey only use 8kb after being compressed. Average compress ratio with ZSTD, is usually between 4x-6x. But depending on how your compressable you data is, you may also get a lot less. Note that changing this block size, will require initialization of a new data file system for your Postgres database.

79. Negitivefrags ◴[21 Oct 24 08:05 UTC] No.41901775{3}[source]▶

>>41897077 #

You could select which storage approach on a per table level. That way the new characteristics don’t surprise anyone.

replies(1): >>41902154 #

80. ddorian43 ◴[21 Oct 24 08:05 UTC] No.41901777[source]▶

>>41900626 #

This is fixed in YugabyteDB that reuses the PostgreSQL query layer source code but uses it's own storage: https://www.yugabyte.com/blog/improve-postgresql/ (other issues too like XID wraparound etc).

It's also apache2 license.

replies(1): >>41907805 #

81. halayli ◴[21 Oct 24 08:10 UTC] No.41901808{3}[source]▶

>>41900275 #

I am referring to physical pages in an SSD disk. The 8k pg page maps to 2 pages in a typical SSD disk. Your comment proves my initial point, which is write amplification cannot be discussed without talking about the disk types and their behavior.

replies(2): >>41902116 #>>41903957 #

82. bvrmn ◴[21 Oct 24 08:12 UTC] No.41901827[source]▶

>>41897492 #

Let me guess, PHP?

replies(1): >>41903625 #

83. thaumasiotes ◴[21 Oct 24 08:57 UTC] No.41902053{5}[source]▶

>>41897887 #

Is that relevant to something? The logical model is identical for every source control system. Deltas are a form of compression for storage in every source control system.

replies(1): >>41904694 #

84. emptiestplace ◴[21 Oct 24 09:09 UTC] No.41902116{4}[source]▶

>>41901808 #

Huh? It seems you've forgotten that you were just saying that a single bit change would result in a 4096 byte write.

replies(1): >>41906755 #

85. fforflo ◴[21 Oct 24 09:09 UTC] No.41902118[source]▶

>>41898326 #

I have a couple of read-heavy >2TB Postgres instances, document-oriented too. You're right that bulk updates can be too slow. Too many times I end up doing the updates incremental (in batches) or even use COPY.

replies(1): >>41903043 #

86. emptiestplace ◴[21 Oct 24 09:14 UTC] No.41902154{4}[source]▶

>>41901775 #

This is the MySQL approach, but it isn't without downsides - consistency, predictability, etc.

87. mxey ◴[21 Oct 24 11:00 UTC] No.41902815[source]▶

>>41895951 (OP) #

> The need for PostgreSQL to modify all of a table’s indexes for each update has several performance implications. Obviously, this makes update queries slower because the system has to do more work.

You know, I was wondering something regarding this write amplification. It's true that MySQL doesn't need to update its indexes like that. However, MySQL replication relies on the binlog, where every change has to be written in addition to the database itself (InnoDB redo log and so on). So, it seems to me, MySQL, if used in a cluster, has a different kind of write amplification. One that PostgreSQL does not have, because it reuses its WAL for the replication.

In addition, on the receiving side, MySQL first writes the incoming binlog to the relay log. The relay log is then consumed by the applier threads, creating more InnoDB writes and (by default) more binlog.

88. j16sdiz ◴[21 Oct 24 11:29 UTC] No.41902986{3}[source]▶

>>41897732 #

> the article literally says that pg's mvcc design is from the 90s and...

Actually, it is 1980s. The article:

> Its design is a relic of the 1980s and before the proliferation of log-structured system patterns from the 1990s.

89. andruby ◴[21 Oct 24 11:40 UTC] No.41903043{3}[source]▶

>>41902118 #

You also want to avoid long transactions to avoid lock contention. Every statement is also a transaction, so chunking it up helps a lot on busy databases.

replies(1): >>41904988 #

90. andruby ◴[21 Oct 24 11:59 UTC] No.41903184[source]▶

>>41898780 #

Similar. My preference switched from MySQL to PostgreSQL in 2005 when I wanted to use database views to create a "live" compatibility layer between an old (AS400) database schema and a modern Rails app.

The preference kept growing thanks to data safety, DDL's in transactions, etc.

91. uvas_pasas_per ◴[21 Oct 24 12:21 UTC] No.41903373[source]▶

>>41895951 (OP) #

My problem with PG is it just doesn't seem to help much with my situation. I want to write apps that work offline and sync data across devices using the 'cloud'. I think that means Sqlite on the client, but ?? for the server. I have yet to find a good book explaining techniques for this kind of replication/syncing.

replies(4): >>41903396 #>>41903423 #>>41903746 #>>41904249 #

92. tapoxi ◴[21 Oct 24 12:24 UTC] No.41903396[source]▶

>>41903373 #

CouchDB?

replies(2): >>41903464 #>>41903976 #

93. beng-nl ◴[21 Oct 24 12:27 UTC] No.41903423[source]▶

>>41903373 #

At risk of saying something you know, CRDT’s may be a good fit for your use case.

The problem is of course making changes offline that the user assumes are permanent but then later, when sync time comes, then turn out to conflict with changes made in the meantime. So changes can’t be made permanent. Either that requires difficult UX to reconcile or something that always will give you something consistent, like a crdt.

94. EGreg ◴[21 Oct 24 12:33 UTC] No.41903464{3}[source]▶

>>41903396 #

PouchDB

95. nobleach ◴[21 Oct 24 12:52 UTC] No.41903625{3}[source]▶

>>41901827 #

I see this pattern all over the place, not just PHP. When leveraging containers, what's a better option? Certainly Java, Rust, even NodeJS all have decent enough connection pools for PG. But telling those apps that are running in those containers, "uhhh, you're just one among many so, your pooling strategy may not represent reality". I've seen folks try to solve this by literally doing replica math when an app boots up (are there 8 replicas running? ok, divide Postgres' default 100 connections by that). Moving the pool outside the app container seems like a better move.

96. sgarland ◴[21 Oct 24 13:05 UTC] No.41903746[source]▶

>>41903373 #

I’ve never used this, but I know people who work here [0]. Might be useful for you?

[0]: https://ditto.live/

97. sgarland ◴[21 Oct 24 13:12 UTC] No.41903808[source]▶

>>41900307 #

You’re not technically wrong, but even Postgres’ docs [0] use the term Primary Key.

[0]: https://www.postgresql.org/docs/current/ddl-constraints.html

replies(1): >>41904432 #

98. mschuster91 ◴[21 Oct 24 13:29 UTC] No.41903957{4}[source]▶

>>41901808 #

> The 8k pg page maps to 2 pages in a typical SSD disk.

You might end up with even more than that due to filesystem metadata (inode records, checksums), metadata of an underlying RAID mechanism or, when working via some sort of networking, stuff like ethernet frame sizes/MTU.

In an ideal world, there would be a clear interface which a program can use to determine for any given combination of storage media, HW RAID, transport layer (local attach vs stuff like iSCSI or NFS), SW RAID (i.e. mdraid), filesystem and filesystem features what the most sensible minimum changeable unit is to avoid unnecessary write amplification bloat.

replies(1): >>41918636 #

99. uvas_pasas_per ◴[21 Oct 24 13:31 UTC] No.41903976{3}[source]▶

>>41903396 #

Looks interesting, but I don't see anything for how to build it into an iOS or Android app. https://docs.couchdb.org/en/stable/install/index.html

replies(1): >>41905390 #

100. topherjaynes ◴[21 Oct 24 13:38 UTC] No.41904048[source]▶

>>41897093 #

You should check out Andy's History of Databases (CMU Databases / Spring 2020) on youtube. He does the entire first class from the streets of Amsterdam because he can't get in his hotel... he's an interesting character and he's insanely good at explaining the ins and out

replies(1): >>41905905 #

101. winternewt ◴[21 Oct 24 13:41 UTC] No.41904079[source]▶

>>41897588 #

> Every application is a bit different, but it's not that the PostgreSQL design is a loser in all regards. It's not like bubble sort.

When doing game development in the early 2000s I learned that bubble sort is not a loser in all regards. It performs well when a list is usually almost sorted. One situation when this is the case is in 3D rendering, which sorts objects by their distance from the camera. As you move the camera around or rotate it, bubble sort works very well for re-sorting the objects given the order they had in the previous frame.

To prevent bad worst-case scenarios you can count the number of comparisons that failed on the last pass and the number of passes you have performed so far, then switch to a different sort algorithm after reaching a threshold.

replies(1): >>41909394 #

102. rand_r ◴[21 Oct 24 13:42 UTC] No.41904085[source]▶

>>41899794 #

Interesting! I wonder how this plays into AWS pricing. They charge a flat fate for MBps of IO. But I don’t know if they have a rule to round up to nearest 4K, or they actually charge you the IO amount from the storage implementation by tracking write volume on the drive itself, rather what you requested.

replies(1): >>41911779 #

103. mannyv ◴[21 Oct 24 14:00 UTC] No.41904249[source]▶

>>41903373 #

Well usually you use a datastore on a server as a master, then you pull/push based on timestamps.

Firebase and cognito/appsync work this way, basically.

You can use any data store you want on the server to do that. You could theoretically push a local sqlite db up to s3 as a sync mechanism, I suppose, if you do the locking correctly.

104. srcreigh ◴[21 Oct 24 14:16 UTC] No.41904432{3}[source]▶

>>41903808 #

Postgres primary key indexes are not primary indexes.

replies(1): >>41905138 #

105. cfors ◴[21 Oct 24 14:22 UTC] No.41904497[source]▶

>>41901448 #

While not strictly for RDBMS, I think this book is pretty close!

https://www.databass.dev/

106. didgetmaster ◴[21 Oct 24 14:31 UTC] No.41904577[source]▶

>>41895951 (OP) #

How big of an issue is this really for db users who work daily with large tables that are frequently updated but still have to be fast for queries?

The article mentioned that there are nearly 900 different databases on the market. I am trying to build yet another one using a unique architecture I developed. It is very fast, but although it is designed for transactions; I haven't implemented them yet.

I think if I spend the time and effort to do it right, this could be a real game changer (I think the architecture lends itself very well to a superior implementation); but I don't want to waste too much time on it if people don't really care one way or the other.

replies(1): >>41911447 #

107. haradion ◴[21 Oct 24 14:41 UTC] No.41904694{6}[source]▶

>>41902053 #

> The logical model is identical for every source control system.

Most source control systems have some common logical concepts (e.g. files and directories), but there's actually significant divergence between their logical models. For instance:

- Classic Perforce (as opposed to Perforce Streams) has a branching model that's very different from Git's; "branches" are basically just directories, and branching/merging is tracked on a per-file basis rather than a per-commit basis. It also tracks revisions by an incrementing ID rather than hashes. - Darcs and Pijul represent the history of a file as an unordered set of patches; a "branch" is basically just a set of patches to apply to the file's initial (empty) state.

All of that is above the physical state, which also differs:

- Perforce servers track files' revision histories in a directory hierarchy that mirrors the repository's file structure rather than building a pseudo-directory hierarchy over a flat object store. - Fossil stores everything in an SQLite database.

> Is that relevant to something?

Yes. You can use a VCS reasonably effectively if you understand its logical model but not its physical storage model. It doesn't work so well the other way around.

108. atombender ◴[21 Oct 24 15:06 UTC] No.41904988{4}[source]▶

>>41903043 #

Avoiding long transactions is also about preventing the transaction from holding back vacuuming. Postgres will not vacuum tuples that are still visible to old transactions (visible as the backend_xmin in the pg_stat_activity table).

Long transactions can also cause surprising locks, because many locks taken persist to the end of the transaction, even if the transaction is no longer doing anything. This can block DDL operations as well as things like REINDEX.

109. sgarland ◴[21 Oct 24 15:20 UTC] No.41905138{4}[source]▶

>>41904432 #

I know that, I’m saying that it’s a fairly pedantic argument when even their own docs use the term loosely.

replies(3): >>41906024 #>>41909511 #>>41916264 #

110. RegnisGnaw ◴[21 Oct 24 15:32 UTC] No.41905260[source]▶

>>41896827 #

That's my big issue with PostgreSQL. I don't care about DDL with logical replication as I can build an replica for the switch, but no LOBs?

111. tapoxi ◴[21 Oct 24 15:46 UTC] No.41905390{4}[source]▶

>>41903976 #

There was a spinoff a while ago when they merged with Membase, I haven't used Couch in a few years but maybe this will work: https://docs.couchbase.com/couchbase-lite/current/swift/quic...

112. ◴[21 Oct 24 16:03 UTC] No.41905546{4}[source]▶

>>41898733 #

113. 15155 ◴[21 Oct 24 16:08 UTC] No.41905601[source]▶

>>41897588 #

> It's not like bubble sort.

Bubble sort is great in hardware and for mostly-sorted sets.

114. Tostino ◴[21 Oct 24 16:43 UTC] No.41905905{3}[source]▶

>>41904048 #

The content that his group puts out on YouTube is great. Been a little while since I caught up, but I was extremely impressed.

115. j16sdiz ◴[21 Oct 24 16:54 UTC] No.41906024{5}[source]▶

>>41905138 #

"Primary Key" is the term in SQL standard. The url you linked is in "Part II: The SQL Language"

116. layer8 ◴[21 Oct 24 17:55 UTC] No.41906576{3}[source]▶

>>41897486 #

It’s not clear why they state “git diff” specifically. It’s simply a diff (git or otherwise).

117. Tostino ◴[21 Oct 24 18:11 UTC] No.41906755{5}[source]▶

>>41902116 #

> a single bit change would result in a 4096 byte write

On (most) SSD hardware, regardless of what software you are using to do the writes.

At least that's how I read their comment.

replies(1): >>41907050 #

118. emptiestplace ◴[21 Oct 24 18:40 UTC] No.41907050{6}[source]▶

>>41906755 #

Right, and if pg writes 8192 bytes every time, this is no longer relevant.

119. gregw2 ◴[21 Oct 24 19:18 UTC] No.41907434{4}[source]▶

>>41898303 #

Err, Linux is a child of the 90s...

Linus began work on it in April 1991: https://groups.google.com/g/comp.os.minix/c/dlNtH7RRrGA/m/_R...

replies(2): >>41915336 #>>41915372 #

120. tristan957 ◴[21 Oct 24 19:44 UTC] No.41907744[source]▶

>>41896827 #

Rearchitecting the storage layer takes time. A storage manager API didn't even exist until fairly recently, maybe 14. That API needs to undergo changes to account for things that Oriole is trying to do. Postgres is not developed by a team. It's a community effort, and people work on what they want to work on. If you're interested in a previous attempt to change the storage layer, you can learn about what happened with zheap[0].

[0]: https://pgpedia.info/z/zheap.html

121. knowitnone ◴[21 Oct 24 19:48 UTC] No.41907805{3}[source]▶

>>41901777 #

Thanks for pointing out Yugabyte. I did a search for performance comparisons and there seems (seemed?) to be some performance issues at least back in 2023 https://github.com/yugabyte/yugabyte-db/issues/10108

hopeful performance has improved since

replies(1): >>41911154 #

122. simne ◴[21 Oct 24 20:40 UTC] No.41908322{4}[source]▶

>>41898235 #

Tell this to developers of Ariane 5, who used old proven software from Ariane 4.

Many people consider this most expensive bug in history, when on first flight of Ariane 5, it enters speed range, which was hard prohibited in Ariane 4 software and caused software exception and then 1 billion crashed.

Honesty, they could re-check all ranges, but they decided, it would cost like write new software, so to save money, was made decision to just use old software, without additional checks.

123. diroussel ◴[21 Oct 24 21:59 UTC] No.41908961{5}[source]▶

>>41898511 #

It would be quite nice to have some of the S3 semantics on local files. Like no one else can the see the file until after you’ve finished writing the file and committed it. And being able to put almost any chara in the file name (key). That is quite nice in S3

124. jeltz ◴[21 Oct 24 22:52 UTC] No.41909394{3}[source]▶

>>41904079 #

But wouldn't insertion sort be better than bubble sort in those cases?

125. jeltz ◴[21 Oct 24 23:02 UTC] No.41909460{3}[source]▶

>>41897182 #

This has already been done.

126. jeltz ◴[21 Oct 24 23:08 UTC] No.41909499{3}[source]▶

>>41897609 #

That article was just wondering pain bad, written by someone who was obviously a beginner to PostgreSQL and did not understsand the issue he was seeing.

Yes, there are real issues but that article should be ignored.

127. jeltz ◴[21 Oct 24 23:09 UTC] No.41909511{5}[source]▶

>>41905138 #

Primary keys are primary indexes are two mostly unrelated terms.

128. Izkata ◴[21 Oct 24 23:22 UTC] No.41909591[source]▶

>>41895951 (OP) #

They mention autvacuum_vacuum_scale_factor, and its default value, but give no hint if they tried to change that. Obviously I have no access to their database, but one of piece of advice for ages has been, in situations similar to theirs where a lot of dead tuples accumulate and autovacuum is having trouble finishing, to lower this value so autovacuum runs much more often, so each run has less to do and less gets blocked.

129. xiasongh ◴[22 Oct 24 01:39 UTC] No.41910345[source]▶

>>41901448 #

CMU's Intro to Database Systems course is one of the best resources. Andy Pavlo has his lectures all up on youtube

130. nuttingd ◴[22 Oct 24 02:45 UTC] No.41910708{5}[source]▶

>>41898851 #

You would need to use serializable isolation for this to hold true. Any isolation level less than serializable will use the snapshot that was active at the time of the select.

In Postgres, even with the serializable isolation level, all transactions that touch the same rows must also be using the serializable isolation level or it's not really enforced. This is one aspect of serializable isolation in Postgres that seemed like a major gotcha for real world application development. There's no future proof solution: new code can be added that doesn't use the serializable isolation, and then the assumptions of isolation from the earlier code are no longer valid.

FOR UPDATE is the only real solution in my eyes

131. ddorian43 ◴[22 Oct 24 04:19 UTC] No.41911154{4}[source]▶

>>41907805 #

Yes, performance is much better. But it's a bit apples to oranges comparisons because of distributed/sharding nature (example with bitmap scans https://www.yugabyte.com/blog/bitmap-scans-on-distributed-po...), and always synchronous replication (https://www.yugabyte.com/blog/yugabytedb-resiliency-vs-postg...).

132. nasmorn ◴[22 Oct 24 05:24 UTC] No.41911447[source]▶

>>41904577 #

Unless you develop some holy grail solution I don’t think anyone will use an unproven DB for OLTP. At least not without HuMongos marketing spend

replies(2): >>41913338 #>>41914726 #

133. chupasaurus ◴[22 Oct 24 06:38 UTC] No.41911779{3}[source]▶

>>41904085 #

> They charge a flat fate for MBps of IO

They actually charge for IOPS, the throughput is just an upper bound that is easier to sell.

For gp SSDs, if requested pages are continuous 256K they will be merged into a single operation. For io page size is 256K on Nitro instances and 16K on everything else, st and sc have 1M page size.

134. guenthert ◴[22 Oct 24 10:31 UTC] No.41912944[source]▶

>>41895951 (OP) #

"And second, traversing the entire version chain just to find the latest version (which is what most queries want) is wasteful."

Is that what most queries want? I would have thought, that the latest version is part of a transaction which isn't committed yet and hence isn't even meant to be found by most queries (only those within the session which opened the modifying transaction). Where did I go wrong?

135. guenthert ◴[22 Oct 24 11:47 UTC] No.41913338{3}[source]▶

>>41911447 #

To expand on this, DB administrators tend to be a conservative bunch. To some extend you can make a slow DB fast by spending big on hardware. No amount of money however will make an unsound DB reliable.

replies(1): >>41914632 #

136. didgetmaster ◴[22 Oct 24 14:24 UTC] No.41914632{4}[source]▶

>>41913338 #

I think it is obvious that no one will want to put their valuable data in an 'unsound' DB.

To restate my original question: If you had two database systems that were equally reliable, but of course had different strengths and weaknesses, would the ability to update large tables without significantly impacting general query speeds, be a major factor in deciding between the two?

137. didgetmaster ◴[22 Oct 24 14:34 UTC] No.41914726{3}[source]▶

>>41911447 #

Is that how PostgeSQL got so popular? After all, at one point it was unproven and I am not aware of a 'HuMongos marketing spend' changing that.

138. kunley ◴[22 Oct 24 15:35 UTC] No.41915336{5}[source]▶

>>41907434 #

I was under impression that he started around 1989 and also that's when he had a debate with prof. Tanenbaum, but now I see it was later. My mistake

139. kunley ◴[22 Oct 24 15:37 UTC] No.41915372{5}[source]▶

>>41907434 #

Btw, hard not to love the line "it's just a hobby, won't be big" from original announcement of Linus... Be careful what you promise ;)

140. ◴[22 Oct 24 16:59 UTC] No.41916264{5}[source]▶

>>41905138 #

141. kunley ◴[22 Oct 24 20:58 UTC] No.41918627{5}[source]▶

>>41899219 #

Yes I agree that implementations must evolve. Still, there are cases where old architectures are just brilliant.

Having said that, I need to add, I am not an expert to say MVCC is good enough to be considered equally good like other write-concurrency mechanism in SQL databases. My example was given to just have a caution when judging, especially that the original counterexample had mentioned notoriously bad architectures (hello, MySQL...)

142. emptiestplace ◴[22 Oct 24 20:59 UTC] No.41918636{5}[source]▶

>>41903957 #

But they are usually separate: an 8192 byte write does fit neatly get into two 4096-byte pages, and metadata writes, while also subject to write amplification, typically occur separately and can represent multiple data blocks.

↑