←back to thread

Dolt is Git for data

(www.dolthub.com)
358 points timsehn | 1 comments | | HN request time: 0s | source
Show context
sytse ◴[] No.22734084[source]
Very cool! The world needs better version control for data.

How does this compare to something like Pachyderm?

How does it work under the covers? What is a splice and what does it mean when it overlaps? https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...

Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?

replies(2): >>22734407 #>>22735544 #
timsehn ◴[] No.22734407[source]
Here is an earlier blog we published on comparison's to Pachyderm: https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-...

We got a blog on the storage system coming on Wednesday. It's a mashup of a Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source package called Noms (https://github.com/attic-labs/noms).

I'm not familiar with CRDT. Will read up on that.

replies(2): >>22734605 #>>22734652 #
jdoliner ◴[] No.22734605[source]
Weighing in as Pachyderm founder.

The post Tim links here is a very apt description of what Pachyderm does. We're designed for version controlling data pipelines, as well as the data they input and output. Pachyderm's filesystem, pfs, is the component that's most similar to dolt. Pfs is a filesystem, rather than a database, so it tends to be used for bigger data formats like videos, genomics files, sometimes databases dumps. And the main reason people do that is so they can run pipelines on top of those data files.

Under the hood the datastructures are actually very similar though, we use a Merkle Tree, rather than a DAG. But the overall algorithm is very similar. Dolt, I think, is a great approach to version controlling SQL style data and access. Noms was a really cool idea that didn't seem to quite find its groove. Whereas dolt seems to have taken the algorithm and made it into more of a tool with practical uses.

replies(1): >>22735198 #
visarga ◴[] No.22735198[source]
How does Pachyderm deal with GDPR requests. Is it possible to remove a file not just from the present but also from the history? It would be no use to delete a file on GPDR request from the current version while still keeping it around in past commits.
replies(1): >>22741527 #
jdoliner ◴[] No.22741527[source]
Request to purge data are one aspect of the GDPR that Pachyderm makes trickier. It makes it easier to remove a piece of data and recompute all of your models without it, because it can deduplicate the computation. But to truly purge a piece of data deduplication becomes a hinderance, because the data can be reference by previous commits, and even by other user's data. You can delete a piece of data and have it not be truly purged.

The best recommendation we have for that is that user's data should be encrypted with a key that's unique to the user, and when that user asks you to purge their data you should throw away the key. That means that even if two users have the same data it will be stored encrypted by different keys, so if one asks for the data to be purged the other can still keep their data.

replies(1): >>22764399 #
visarga ◴[] No.22764399[source]
But then wouldn't the storage and distribution of keys become a similar problem to the original one? If the keys get distributed, then it's hard to really remove them.
replies(1): >>22805989 #
1. jdoliner ◴[] No.22805989{3}[source]
Yes, all the keys do is scale the problem down. In general this is a very tough problem, everything else in the system is designed to avoid data loss, that's the biggest scariest failure case. But then when you want to lose data all the measures in the system to prevent data loss prevent that from happening.