Dolt is Git for Data

(github.com)

334 points gjvc | 2 comments | 23 Jun 22 11:04 UTC | HN request time: 0.543s | source

Show context

remram ◴[23 Jun 22 15:33 UTC] No.31850562[source]▶

What kind of merge actually happens, e.g. how does it deal with conflicts? Does it merge at the SQL command level or at the changed tuple level?

If I have

    name   job age
    remram dev 32

and I concurrently do

    UPDATE table SET age = age + 2 WHERE job = 'dev';
    UPDATE table SET job = 'mgr' WHERE age = 32;

Do I get a conflict? Do I get something consistent with the two transactions serialized, e.g. dev 34 or op 32)? Can I get something that no ordering of transaction could have given me, like mgr 34?

replies(1): >>31850773 #

zachmu ◴[23 Jun 22 15:47 UTC] No.31850773[source]▶

>>31850562 #

Merge is cell-wise. If the two heads being merged edited disjoint cells, there's no conflict, they merge together. If they touched the same cell, or if one branch deleted a row another added, then you get a conflict.

Merge conflicts are stored in a special system table so you can decide what to do with them, kind of analogous to conflict markers in a source file. More details here:

https://docs.dolthub.com/sql-reference/version-control/merge...

The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff. Dolt supports REPEATABLE_READ right now, with others coming in future releases. So in the example above, whichever transaction committed last would fail and get rolled back (assuming they touched the same rows).

replies(1): >>31852404 #

remram ◴[23 Jun 22 17:28 UTC] No.31852404[source]▶

>>31850773 #

> The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff

I mean, arguably. It's not like there is a standard definition for "merge operation" on data. Even Git tries to do more than line-level changes, taking context into account, and turning one line change into two line changes if merging across a file copy for example.

Dolt markets itself as "a version controlled SQL database", so I think it is perfectly reasonable to consider the standard that already exists for concurrent changes to a SQL database, and that's transaction isolation.

I guess anything more complex than this would be pretty unwieldy though, with probably little benefits. I am struggling to come up with a good example for the kind of anomaly I imagine.

replies(1): >>31853137 #

zachmu ◴[23 Jun 22 18:19 UTC] No.31853137[source]▶

>>31852404 #

You're thinking too small. The transaction is generally not the unit you want to apply version control to, databases already have robust concurrency support at that level.

What you want is to have branches on your data that have many transactions applied to them, for days or weeks. Then you merge the branch back to main when it's ready to ship.

An early customer built pull requests for their configuration application using Dolt, read more here:

https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/

replies(1): >>31854351 #

1. remram ◴[23 Jun 22 19:53 UTC] No.31854351[source]▶

>>31853137 #

Please don't tell me what I'm thinking or what I want to do ;-)

The use case I had in mind is closer to a data.world or dbhub, collaborative data cleaning (which often include bulk operations) rather than merging OLTP databases after days or weeks of changes.

What I would use now is a script or Jupyter Notebook checked in Git, where cells are strongly ordered and if someone sends me a pull request (changing the code) I have to re-run the notebook to obtain the "merged dataset". I can't say that I have a use for Dolt but it is definitely cool tech.

replies(1): >>31854598 #

2. zachmu ◴[23 Jun 22 20:15 UTC] No.31854598[source]▶

>>31854351 (TP) #

If you're a data cleaning guy, you might be interested in our data bounties program. It's collaborative data importing / cleaning. The current one is building a catalog of museum collections. We launch a new one every month.

https://www.dolthub.com/blog/2022-06-14-announcing-museums-b...

↑