←back to thread

752 points crazypython | 1 comments | | HN request time: 0.209s | source
Show context
Ericson2314 ◴[] No.26371316[source]
What people usually miss about these things is normal version control benefits hugely from content addressing and normal forms.

The salient aspect of relational data is that it's cyclic, this makes content addressing unable to provide normal forms on it's own (unless someone figures out how to Merkle cylic graphs!), but the normal form can still made other ways.

The first part is easier enough, store rows in some order.

The second part is more interesting: making the choice of surrogate keys not matter (quotienting it away). Sorting table rows containing surrogate keys depending on the sorting of table rows makes for some interesting bags of constraints, for which there may be more than one fixed point.

Example:

  CREATE TABLE Foo (
    a uuid PRIMARY KEY,
    b text,
    best_friend uuid REFERENCES Foo(b)
  );
DB 0:

  0 Alice 0
1 reclusive Alice, best friends with herself. Just fine.

  0 Alice 1
  1 Alice 1
2 reclusive Alices, both best friends with the second one. The alices are the same up to primary keys, but while primary keys are to be quotiented out, primary key equality isn't, so this is valid. And we have an asymmetry by which to sort.

  0 Alice 1
  1 Alice 0
2 reclusive Alices, each best friends with the other. The Alices are completely isomorphic, and one notion of normal forms would say this is exactly the same as DB 0: as if this is reclusive Alice in a fun house of mirrors.

All this is resolvable, but it's subtle. And there's no avoiding complexity. E.g. if one wants to cross reference two human data entries who each assigned their own surrogate IDs, this type of analysis must be done. Likewise when merging forks of a database.

I'd love be wrong, but I don't think any of the folks doing "git for data" are planning their technology with this level of mathematical rigor.

replies(4): >>26371380 #>>26371595 #>>26371670 #>>26372543 #
1. ghusbands ◴[] No.26371670[source]
> The salient aspect of relational data is that it's cyclic

This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.

Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.