←back to thread

Dolt is Git for data

(www.dolthub.com)
358 points timsehn | 10 comments | | HN request time: 0s | source | bottom
Show context
sytse ◴[] No.22734084[source]
Very cool! The world needs better version control for data.

How does this compare to something like Pachyderm?

How does it work under the covers? What is a splice and what does it mean when it overlaps? https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...

Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?

replies(2): >>22734407 #>>22735544 #
1. timsehn ◴[] No.22734407[source]
Here is an earlier blog we published on comparison's to Pachyderm: https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-...

We got a blog on the storage system coming on Wednesday. It's a mashup of a Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source package called Noms (https://github.com/attic-labs/noms).

I'm not familiar with CRDT. Will read up on that.

replies(2): >>22734605 #>>22734652 #
2. jdoliner ◴[] No.22734605[source]
Weighing in as Pachyderm founder.

The post Tim links here is a very apt description of what Pachyderm does. We're designed for version controlling data pipelines, as well as the data they input and output. Pachyderm's filesystem, pfs, is the component that's most similar to dolt. Pfs is a filesystem, rather than a database, so it tends to be used for bigger data formats like videos, genomics files, sometimes databases dumps. And the main reason people do that is so they can run pipelines on top of those data files.

Under the hood the datastructures are actually very similar though, we use a Merkle Tree, rather than a DAG. But the overall algorithm is very similar. Dolt, I think, is a great approach to version controlling SQL style data and access. Noms was a really cool idea that didn't seem to quite find its groove. Whereas dolt seems to have taken the algorithm and made it into more of a tool with practical uses.

replies(1): >>22735198 #
3. jamesblonde ◴[] No.22734652[source]
What is your take on the need for time-travel queries for versioned, mutable data? Versioning immutable data items is not enough if you have structured data that is updated. Every time you update a data item, you store a full copy - not a diff of the actual data. You are not able to make "time-travel queries" - give me the data that was generated in this time-range, for example.

For example, if you have a Feature Store for ML, and you want to say "Give me train/test data for these features for the years 2012-2020". This isn't possible with versioned immutable data items. Also, if you don't store the diffs in data - if you store immutable copies, you get explosive growth in data volumes. There are 2 (maybe 3) frameworks that allow such time-travel queries i am aware of: Apache Hudi (Uber) and Databricks Delta. (Apache Iceberg by Netflix will have support soon.)

Reference:

https://www.logicalclocks.com/blog/mlops-with-a-feature-stor...

replies(1): >>22734670 #
4. timsehn ◴[] No.22734670[source]
The storage system we use only stores the rows that change. We have a blog post we're publishing on Wednesday explaining how.
replies(1): >>22734971 #
5. jamesblonde ◴[] No.22734971{3}[source]
That's nice. Do you have any idea if it is possible to translate those rows into higher-level time-travel queries? Like if you could plugin an adapter to transform the rows into a data structure (parquet, arrow, json, whatever) that could be useful to analytics and ML apps?
replies(1): >>22735027 #
6. timsehn ◴[] No.22735027{4}[source]
Like "as of" queries or history queries? We have both of those.

AS OF: https://www.dolthub.com/blog/2020-03-20-querying-historical-...

HISTORY SYSTEM TABLE: https://www.dolthub.com/blog/2020-01-23-access-to-everything...

You can run `dolt q -r csv -q <query>` to output whatever you want to a CSV. We would need to do work to output a hierarchical format.

I'm sure it's possible to build whatever time travel operation you want. We can produce an audit log of every cell in the database pretty quickly.

7. visarga ◴[] No.22735198[source]
How does Pachyderm deal with GDPR requests. Is it possible to remove a file not just from the present but also from the history? It would be no use to delete a file on GPDR request from the current version while still keeping it around in past commits.
replies(1): >>22741527 #
8. jdoliner ◴[] No.22741527{3}[source]
Request to purge data are one aspect of the GDPR that Pachyderm makes trickier. It makes it easier to remove a piece of data and recompute all of your models without it, because it can deduplicate the computation. But to truly purge a piece of data deduplication becomes a hinderance, because the data can be reference by previous commits, and even by other user's data. You can delete a piece of data and have it not be truly purged.

The best recommendation we have for that is that user's data should be encrypted with a key that's unique to the user, and when that user asks you to purge their data you should throw away the key. That means that even if two users have the same data it will be stored encrypted by different keys, so if one asks for the data to be purged the other can still keep their data.

replies(1): >>22764399 #
9. visarga ◴[] No.22764399{4}[source]
But then wouldn't the storage and distribution of keys become a similar problem to the original one? If the keys get distributed, then it's hard to really remove them.
replies(1): >>22805989 #
10. jdoliner ◴[] No.22805989{5}[source]
Yes, all the keys do is scale the problem down. In general this is a very tough problem, everything else in the system is designed to avoid data loss, that's the biggest scariest failure case. But then when you want to lose data all the measures in the system to prevent data loss prevent that from happening.