MVCC – the part of PostgreSQL we hate the most (2023)

1. nightfly ◴[20 Oct 24 18:26 UTC] No.41897421[source]▶

> MySQL and Oracle store a compact delta between the new and current versions (think of it like a git diff).

Doesn't git famously _not_ store diffs and instead follows the same storage pattern postgres uses here and stores the full new and old objects?

replies(6): >>41897457 #>>41897486 #>>41897759 #>>41897885 #>>41899164 #>>41899189 #

2. jmholla ◴[20 Oct 24 18:31 UTC] No.41897457[source]▶

>>41897421 (TP) #

That is correct. Each version of a file is a separate blob. There is some compression done by packing to make cloning faster, but the raw for git works with is these blobs.

replies(2): >>41897535 #>>41898446 #

3. ChadNauseam ◴[20 Oct 24 18:36 UTC] No.41897486[source]▶

>>41897421 (TP) #

TBF, the quoted section doesn't say that git stores diffs (or anything about git storage), it just says that what MySQL and Oracle stores is similar to a git diff.

replies(2): >>41900208 #>>41906576 #

4. Hendrikto ◴[20 Oct 24 19:13 UTC] No.41897759[source]▶

>>41897421 (TP) #

Git diffs are generated on the fly, but diffs are still diffs.

5. simonw ◴[20 Oct 24 19:15 UTC] No.41897771{3}[source]▶

>>41897535 #

Saying "that's incorrect" is a lot more productive than saying "that's a lie".

Calling something a lie implies that the incorrect information was deliberate.

6. ori_b ◴[20 Oct 24 19:21 UTC] No.41897810{3}[source]▶

>>41897535 #

Git does both. When you create a commit, it stores a full (zipped) copy of the object, without any deltas.

Periodically (I believe it used to be every thousand commits, though I'm not sure what the heuristic is today), git will take the loose objects and compress them into a pack.

The full blob format is how objects are manipulated by git internally: to do anything useful, the objects need to be extracted from the blob, with all deltas applied, before anything can be done with them.

It's also worth nothing that accessing a deltified object is slow (O(n) in the number of deltas), so the length of the delta chain is limited. Because deltification is really just a compression format, it doesn't matter how or where the deltas are done -- the trivial "no deltas" option will work just fine if you want to implement that.

You can trivially verify this by creating commits and looking in '.git/objects/*' for loose objects, running 'git repack', and then looking in '.git/objects/pack' for the deltified packs.

7. paulddraper ◴[20 Oct 24 19:32 UTC] No.41897885[source]▶

>>41897421 (TP) #

1. The comparison was to MySQL and Oracle storage using git diff format as an analogy, not git storage.

2. git storage does compress, and the compression is "diff-based" of sorts, but it is not based on commit history as one might naively expect.

8. haradion ◴[20 Oct 24 19:32 UTC] No.41897887{3}[source]▶

>>41897535 #

The file contents are logically distinct blobs. Packfiles will aggregate and delta-compress similar blobs, but that's all at a lower level than the logical model.

replies(1): >>41902053 #

9. quotemstr ◴[20 Oct 24 21:04 UTC] No.41898446[source]▶

>>41897457 #

git's model is a good example of layered architecture. Most of the code works in terms of whole blobs. The blob storage system, as an implementation detail, stores some blobs with diffs. The use of diffs doesn't leak into the rest of the system. Good separation of concerns

10. arp242 ◴[20 Oct 24 22:21 UTC] No.41898972{3}[source]▶

>>41897535 #

Sjeez, tone it down. People can be incorrect without lying.

11. ◴[20 Oct 24 22:56 UTC] No.41899164[source]▶

>>41897421 (TP) #

12. epcoa ◴[20 Oct 24 23:00 UTC] No.41899189[source]▶

>>41897421 (TP) #

Others have mentioned that it said “git diffs”. However git does use deltas in pack files as a low level optimization, similar to the MySQL comparison. You don’t get back diffs from a SQL query either.

13. zdragnar ◴[21 Oct 24 02:34 UTC] No.41900208[source]▶

>>41897486 #

It's a little too easy to misinterpret if you're skimming and still have memories of working with SVN, mercurial, perforce, and probably others (I've intentionally repressed everything about tfvc).

14. thaumasiotes ◴[21 Oct 24 08:57 UTC] No.41902053{4}[source]▶

>>41897887 #

Is that relevant to something? The logical model is identical for every source control system. Deltas are a form of compression for storage in every source control system.

replies(1): >>41904694 #

15. haradion ◴[21 Oct 24 14:41 UTC] No.41904694{5}[source]▶

>>41902053 #

> The logical model is identical for every source control system.

Most source control systems have some common logical concepts (e.g. files and directories), but there's actually significant divergence between their logical models. For instance:

- Classic Perforce (as opposed to Perforce Streams) has a branching model that's very different from Git's; "branches" are basically just directories, and branching/merging is tracked on a per-file basis rather than a per-commit basis. It also tracks revisions by an incrementing ID rather than hashes. - Darcs and Pijul represent the history of a file as an unordered set of patches; a "branch" is basically just a set of patches to apply to the file's initial (empty) state.

All of that is above the physical state, which also differs:

- Perforce servers track files' revision histories in a directory hierarchy that mirrors the repository's file structure rather than building a pseudo-directory hierarchy over a flat object store. - Fossil stores everything in an SQLite database.

> Is that relevant to something?

Yes. You can use a VCS reasonably effectively if you understand its logical model but not its physical storage model. It doesn't work so well the other way around.

16. layer8 ◴[21 Oct 24 17:55 UTC] No.41906576[source]▶

>>41897486 #

It’s not clear why they state “git diff” specifically. It’s simply a diff (git or otherwise).