←back to thread

Dolt is Git for data

(www.dolthub.com)
358 points timsehn | 4 comments | | HN request time: 0.825s | source
Show context
sytse ◴[] No.22734084[source]
Very cool! The world needs better version control for data.

How does this compare to something like Pachyderm?

How does it work under the covers? What is a splice and what does it mean when it overlaps? https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...

Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?

replies(2): >>22734407 #>>22735544 #
timsehn ◴[] No.22734407[source]
Here is an earlier blog we published on comparison's to Pachyderm: https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-...

We got a blog on the storage system coming on Wednesday. It's a mashup of a Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source package called Noms (https://github.com/attic-labs/noms).

I'm not familiar with CRDT. Will read up on that.

replies(2): >>22734605 #>>22734652 #
1. jamesblonde ◴[] No.22734652[source]
What is your take on the need for time-travel queries for versioned, mutable data? Versioning immutable data items is not enough if you have structured data that is updated. Every time you update a data item, you store a full copy - not a diff of the actual data. You are not able to make "time-travel queries" - give me the data that was generated in this time-range, for example.

For example, if you have a Feature Store for ML, and you want to say "Give me train/test data for these features for the years 2012-2020". This isn't possible with versioned immutable data items. Also, if you don't store the diffs in data - if you store immutable copies, you get explosive growth in data volumes. There are 2 (maybe 3) frameworks that allow such time-travel queries i am aware of: Apache Hudi (Uber) and Databricks Delta. (Apache Iceberg by Netflix will have support soon.)

Reference:

https://www.logicalclocks.com/blog/mlops-with-a-feature-stor...

replies(1): >>22734670 #
2. timsehn ◴[] No.22734670[source]
The storage system we use only stores the rows that change. We have a blog post we're publishing on Wednesday explaining how.
replies(1): >>22734971 #
3. jamesblonde ◴[] No.22734971[source]
That's nice. Do you have any idea if it is possible to translate those rows into higher-level time-travel queries? Like if you could plugin an adapter to transform the rows into a data structure (parquet, arrow, json, whatever) that could be useful to analytics and ML apps?
replies(1): >>22735027 #
4. timsehn ◴[] No.22735027{3}[source]
Like "as of" queries or history queries? We have both of those.

AS OF: https://www.dolthub.com/blog/2020-03-20-querying-historical-...

HISTORY SYSTEM TABLE: https://www.dolthub.com/blog/2020-01-23-access-to-everything...

You can run `dolt q -r csv -q <query>` to output whatever you want to a CSV. We would need to do work to output a hierarchical format.

I'm sure it's possible to build whatever time travel operation you want. We can produce an audit log of every cell in the database pretty quickly.