Dolt is Git for data | slacker news

So, we ingest a third-party dataset that changes daily. One of our problems is that we need to retrospectively measure arbitrary metrics (how many X had condition Y on days 1 through 180 of the current year?). Imagine the external data like this:

UUID,CategoryA,CategoryACount,CategoryB,CategoryBCount,BooleanC,BooleanD...etc

When we ingest a new UUID, we add a column "START_DATE" which is the first date the UUID's metrics were valid. When any of the metric counts changes, we add "END_DATE" to the row and add a new row for that UUID with an updated START_DATE.

It works, but it sucks to analyse because you have to partition the database by the days each row was valid and do your aggregations on those partitions. And it sucks to get a snapshot of how a dataset looked on a particular day. It would be much easier if we could just access the daily diffs, which seems like a task Dolt would accomplish.

I mean it has a better chance of working than getting the third party to implement versioning on their data feed.