Dolt is Git for Data

(github.com)

334 points gjvc | 4 comments | 23 Jun 22 11:04 UTC | HN request time: 1.54s | source

Show context

cosmic_quanta ◴[23 Jun 22 12:00 UTC] No.31847838[source]▶

That looks awesome. One of the listed use-cases is 'time-travel': https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...

I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).

Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool

replies(5): >>31847959 #>>31848014 #>>31849805 #>>31849874 #>>31859003 #

kortex ◴[23 Jun 22 12:22 UTC] No.31848014[source]▶

>>31847838 #

Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.

replies(2): >>31848910 #>>31849064 #

nerdponx ◴[23 Jun 22 13:47 UTC] No.31849064[source]▶

>>31848014 #

DVC is great for tracking locally-stored data and artifacts generated in the course of a research project, and for sharing those artifacts across a team of collaborators (and/or future users).

However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.

This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.

The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.

I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."

replies(1): >>31850469 #

1. henrydark ◴[23 Jun 22 15:27 UTC] No.31850469[source]▶

>>31849064 #

I have dvc pipelines such that input/output is iceberg snapshot files. The data gets medium-big and it works well.

replies(1): >>31850507 #

2. nerdponx ◴[23 Jun 22 15:30 UTC] No.31850507[source]▶

>>31850469 (TP) #

I never heard of Apache Iceberg before. I've used Databricks Delta Lake; is it similar? What is a snapshot file in this case?

replies(1): >>31852048 #

3. henrydark ◴[23 Jun 22 17:05 UTC] No.31852048[source]▶

>>31850507 #

It's basically the same, I just went with iceberg because the specification is a bit more transparent

replies(1): >>31853090 #

4. nerdponx ◴[23 Jun 22 18:15 UTC] No.31853090{3}[source]▶

>>31852048 #

Interesting. So the snapshot file acts much in the same way as a manual "sentinel" file? I generally try to avoid such things because they are brittle and it's easy make a mistake and get the "ad hoc database on your filesystem" out of sync with the actual data.

↑