Dolt is Git for Data

(github.com)

Show context

cosmic_quanta ◴[23 Jun 22 12:00 UTC] No.31847838[source]▶

That looks awesome. One of the listed use-cases is 'time-travel': https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...

I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).

Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool

replies(5): >>31847959 #>>31848014 #>>31849805 #>>31849874 #>>31859003 #

1. kortex ◴[23 Jun 22 12:22 UTC] No.31848014[source]▶

>>31847838 #

Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.

replies(2): >>31848910 #>>31849064 #

2. isolli ◴[23 Jun 22 13:34 UTC] No.31848910[source]▶

>>31848014 (TP) #

I'm looking into DVC right now, and I feel like the code history (in git) and the data history are too intertwined. If you move the git HEAD back, then you get the old data back, but you also get the old code back. I wish there was a way to move the two "heads" independently. Or is there?

Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.

replies(2): >>31849719 #>>31854496 #

3. nerdponx ◴[23 Jun 22 13:47 UTC] No.31849064[source]▶

>>31848014 (TP) #

DVC is great for tracking locally-stored data and artifacts generated in the course of a research project, and for sharing those artifacts across a team of collaborators (and/or future users).

However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.

This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.

The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.

I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."

replies(1): >>31850469 #

4. george_ciobanu ◴[23 Jun 22 14:38 UTC] No.31849719[source]▶

>>31848910 #

also check out Datomic.

5. henrydark ◴[23 Jun 22 15:27 UTC] No.31850469[source]▶

>>31849064 #

I have dvc pipelines such that input/output is iceberg snapshot files. The data gets medium-big and it works well.

replies(1): >>31850507 #

6. nerdponx ◴[23 Jun 22 15:30 UTC] No.31850507{3}[source]▶

>>31850469 #

I never heard of Apache Iceberg before. I've used Databricks Delta Lake; is it similar? What is a snapshot file in this case?

replies(1): >>31852048 #

7. henrydark ◴[23 Jun 22 17:05 UTC] No.31852048{4}[source]▶

>>31850507 #

It's basically the same, I just went with iceberg because the specification is a bit more transparent

replies(1): >>31853090 #

8. nerdponx ◴[23 Jun 22 18:15 UTC] No.31853090{5}[source]▶

>>31852048 #

Interesting. So the snapshot file acts much in the same way as a manual "sentinel" file? I generally try to avoid such things because they are brittle and it's easy make a mistake and get the "ad hoc database on your filesystem" out of sync with the actual data.

9. arjvik ◴[23 Jun 22 20:05 UTC] No.31854496[source]▶

>>31848910 #

If you want the dataset to be independent, I would recommend having a seperate repository for the dataset, and using Git Submodules to pull it in. That way you can checkout different versions of the dataset and code because they are essentially in seperate working trees.

↑