←back to thread

334 points gjvc | 9 comments | | HN request time: 0.424s | source | bottom
Show context
cosmic_quanta ◴[] No.31847838[source]
That looks awesome. One of the listed use-cases is 'time-travel': https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...

I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).

Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool

replies(5): >>31847959 #>>31848014 #>>31849805 #>>31849874 #>>31859003 #
1. kortex ◴[] No.31848014[source]
Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.
replies(2): >>31848910 #>>31849064 #
2. isolli ◴[] No.31848910[source]
I'm looking into DVC right now, and I feel like the code history (in git) and the data history are too intertwined. If you move the git HEAD back, then you get the old data back, but you also get the old code back. I wish there was a way to move the two "heads" independently. Or is there?

Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.

replies(2): >>31849719 #>>31854496 #
3. nerdponx ◴[] No.31849064[source]
DVC is great for tracking locally-stored data and artifacts generated in the course of a research project, and for sharing those artifacts across a team of collaborators (and/or future users).

However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.

This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.

The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.

I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."

replies(1): >>31850469 #
4. george_ciobanu ◴[] No.31849719[source]
also check out Datomic.
5. henrydark ◴[] No.31850469[source]
I have dvc pipelines such that input/output is iceberg snapshot files. The data gets medium-big and it works well.
replies(1): >>31850507 #
6. nerdponx ◴[] No.31850507{3}[source]
I never heard of Apache Iceberg before. I've used Databricks Delta Lake; is it similar? What is a snapshot file in this case?
replies(1): >>31852048 #
7. henrydark ◴[] No.31852048{4}[source]
It's basically the same, I just went with iceberg because the specification is a bit more transparent
replies(1): >>31853090 #
8. nerdponx ◴[] No.31853090{5}[source]
Interesting. So the snapshot file acts much in the same way as a manual "sentinel" file? I generally try to avoid such things because they are brittle and it's easy make a mistake and get the "ad hoc database on your filesystem" out of sync with the actual data.
9. arjvik ◴[] No.31854496[source]
If you want the dataset to be independent, I would recommend having a seperate repository for the dataset, and using Git Submodules to pull it in. That way you can checkout different versions of the dataset and code because they are essentially in seperate working trees.