Most active commenters
  • nerdponx(3)

←back to thread

334 points gjvc | 15 comments | | HN request time: 0.207s | source | bottom
1. cosmic_quanta ◴[] No.31847838[source]
That looks awesome. One of the listed use-cases is 'time-travel': https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...

I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).

Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool

replies(5): >>31847959 #>>31848014 #>>31849805 #>>31849874 #>>31859003 #
2. wodenokoto ◴[] No.31847959[source]
For at least your use-case, this is a fairly common feature in enterprise databases.

For a lot of types of data, auditing requires history.

Have a look at MSSQLs history table, for example: https://docs.microsoft.com/en-us/sql/relational-databases/ta...

3. kortex ◴[] No.31848014[source]
Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.
replies(2): >>31848910 #>>31849064 #
4. isolli ◴[] No.31848910[source]
I'm looking into DVC right now, and I feel like the code history (in git) and the data history are too intertwined. If you move the git HEAD back, then you get the old data back, but you also get the old code back. I wish there was a way to move the two "heads" independently. Or is there?

Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.

replies(2): >>31849719 #>>31854496 #
5. nerdponx ◴[] No.31849064[source]
DVC is great for tracking locally-stored data and artifacts generated in the course of a research project, and for sharing those artifacts across a team of collaborators (and/or future users).

However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.

This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.

The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.

I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."

replies(1): >>31850469 #
6. george_ciobanu ◴[] No.31849719{3}[source]
also check out Datomic.
7. yread ◴[] No.31849805[source]
We use DataVault for that. And perhaps Databricks at some point in the future
8. lichtenberger ◴[] No.31849874[source]
Basically my research project[1] I'm working on in my spare time is all about versioning and efficiently storing small sized revisions of the data as well as allowing sophisticated time travel queries for audits and analysis.

Of course all secondary user-defined, typed indexes are also versioned.

Basically the technical idea is to map a huge tree of index tries (with revisions as indexed leave pages at the top-level and a document index as well as secondary indexes on the second level) to an append-only file. To reduce write amplification and to reduce the size of each snapshot data pages are first compressed and second versioned through a sliding snapshot algorithm. Thus, Sirix does not simply do a copy on write per page. Instead it writes nodes, which have been changed in the current revision plus nodes which fall out of the sliding window (therefore it needs a fast random-read drive).

[1] https://github.com/sirixdb/sirix

replies(1): >>31853257 #
9. henrydark ◴[] No.31850469{3}[source]
I have dvc pipelines such that input/output is iceberg snapshot files. The data gets medium-big and it works well.
replies(1): >>31850507 #
10. nerdponx ◴[] No.31850507{4}[source]
I never heard of Apache Iceberg before. I've used Databricks Delta Lake; is it similar? What is a snapshot file in this case?
replies(1): >>31852048 #
11. henrydark ◴[] No.31852048{5}[source]
It's basically the same, I just went with iceberg because the specification is a bit more transparent
replies(1): >>31853090 #
12. nerdponx ◴[] No.31853090{6}[source]
Interesting. So the snapshot file acts much in the same way as a manual "sentinel" file? I generally try to avoid such things because they are brittle and it's easy make a mistake and get the "ad hoc database on your filesystem" out of sync with the actual data.
13. awmarthur ◴[] No.31853257[source]
That sounds somewhat similar to Dolt's storage index structure: Prolly Trees https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...
14. arjvik ◴[] No.31854496{3}[source]
If you want the dataset to be independent, I would recommend having a seperate repository for the dataset, and using Git Submodules to pull it in. That way you can checkout different versions of the dataset and code because they are essentially in seperate working trees.
15. Cthulhu_ ◴[] No.31859003[source]
I've (partially / POC) implemented time travel in a SQLite database; the TL;DR is that whenever you create a table, you add a second, identical or nearly-identical table with a `_history` suffix; the history table has a valid from and valid to. Then you add a trigger on the primary table that, on update or on delete, makes a copy of the old values into the history table, setting the 'valid_to' column to the current timestamp.

The reason I used a separate table is so that you don't have to compromise or complicate the primary table's constraints, indices and foreign keys; the history table doesn't really need those because it's not responsible for data integrity.

Anyway, once you have that, you can run queries with a `where $timestamp is between start_date and end_date` condition, which will also allow you to join many tables at a certain point in time. To also be able to get the latest version, you can use a union query (iirc).

I'm sure there's a lot of caveats there though. What I should do is take some time in the weekend and work on that POC, publish it for posterity / blog post fuel.