←back to thread

Dolt is Git for data

(www.dolthub.com)
358 points timsehn | 2 comments | | HN request time: 0.001s | source
1. flashman ◴[] No.22734676[source]
So, we ingest a third-party dataset that changes daily. One of our problems is that we need to retrospectively measure arbitrary metrics (how many X had condition Y on days 1 through 180 of the current year?). Imagine the external data like this:

UUID,CategoryA,CategoryACount,CategoryB,CategoryBCount,BooleanC,BooleanD...etc

When we ingest a new UUID, we add a column "START_DATE" which is the first date the UUID's metrics were valid. When any of the metric counts changes, we add "END_DATE" to the row and add a new row for that UUID with an updated START_DATE.

It works, but it sucks to analyse because you have to partition the database by the days each row was valid and do your aggregations on those partitions. And it sucks to get a snapshot of how a dataset looked on a particular day. It would be much easier if we could just access the daily diffs, which seems like a task Dolt would accomplish.

I mean it has a better chance of working than getting the third party to implement versioning on their data feed.

replies(1): >>22734778 #
2. jamesblonde ◴[] No.22734778[source]
You can accomplish this using time-travel queries in frameworks like Apache Hudi and Databricks Delta that i mentioned in more detail in an earlier comment. They only work for Spark-based data pipelines.