Dolt is Git for data

(www.dolthub.com)

358 points timsehn | 4 comments | 30 Mar 20 20:24 UTC | HN request time: 0.831s | source

Show context

sytse ◴[31 Mar 20 01:22 UTC] No.22734084[source]▶

>>22731928 (OP) #

Very cool! The world needs better version control for data.

How does this compare to something like Pachyderm?

How does it work under the covers? What is a splice and what does it mean when it overlaps? https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...

Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?

replies(2): >>22734407 #>>22735544 #

timsehn ◴[31 Mar 20 02:25 UTC] No.22734407[source]▶

>>22734084 #

Here is an earlier blog we published on comparison's to Pachyderm: https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-...

We got a blog on the storage system coming on Wednesday. It's a mashup of a Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source package called Noms (https://github.com/attic-labs/noms).

I'm not familiar with CRDT. Will read up on that.

replies(2): >>22734605 #>>22734652 #

1. jamesblonde ◴[31 Mar 20 03:19 UTC] No.22734652[source]▶

>>22734407 #

What is your take on the need for time-travel queries for versioned, mutable data? Versioning immutable data items is not enough if you have structured data that is updated. Every time you update a data item, you store a full copy - not a diff of the actual data. You are not able to make "time-travel queries" - give me the data that was generated in this time-range, for example.

For example, if you have a Feature Store for ML, and you want to say "Give me train/test data for these features for the years 2012-2020". This isn't possible with versioned immutable data items. Also, if you don't store the diffs in data - if you store immutable copies, you get explosive growth in data volumes. There are 2 (maybe 3) frameworks that allow such time-travel queries i am aware of: Apache Hudi (Uber) and Databricks Delta. (Apache Iceberg by Netflix will have support soon.)

Reference:

https://www.logicalclocks.com/blog/mlops-with-a-feature-stor...

replies(1): >>22734670 #

2. timsehn ◴[31 Mar 20 03:23 UTC] No.22734670[source]▶

>>22734652 (TP) #

The storage system we use only stores the rows that change. We have a blog post we're publishing on Wednesday explaining how.

replies(1): >>22734971 #

3. jamesblonde ◴[31 Mar 20 04:38 UTC] No.22734971[source]▶

>>22734670 #

That's nice. Do you have any idea if it is possible to translate those rows into higher-level time-travel queries? Like if you could plugin an adapter to transform the rows into a data structure (parquet, arrow, json, whatever) that could be useful to analytics and ML apps?

replies(1): >>22735027 #

4. timsehn ◴[31 Mar 20 04:52 UTC] No.22735027{3}[source]▶

>>22734971 #

Like "as of" queries or history queries? We have both of those.

AS OF: https://www.dolthub.com/blog/2020-03-20-querying-historical-...

HISTORY SYSTEM TABLE: https://www.dolthub.com/blog/2020-01-23-access-to-everything...

You can run `dolt q -r csv -q <query>` to output whatever you want to a CSV. We would need to do work to output a hierarchical format.

I'm sure it's possible to build whatever time travel operation you want. We can produce an audit log of every cell in the database pretty quickly.

↑