←back to thread

752 points crazypython | 1 comments | | HN request time: 0.198s | source
Show context
scottmcdot ◴[] No.26371504[source]
Dolt might be good but never underestimate the power of Type 2 Slowly Changing Dimension tables [1]. For example, if you had an SSIS package that took CSV and imported them into a database, and one day you noticed it accidently rounded the value incorrectly, you could fix the data and retain traceability of the data which was there originally.

E.g., SSIS package writes row of data: https://imgur.com/DClXAi5

Then a few months later (on 2020-08-15) we identify that trans_value was imported incorrectly so we update it: https://imgur.com/wdQJWm4

Then whenever we SELECT from the table we always ensure we are extracting "today's" version of the data:

    select * from table where TODAY between effective_from and effective_to
[1] https://en.wikipedia.org/wiki/Slowly_changing_dimension
replies(4): >>26371578 #>>26371814 #>>26371976 #>>26374137 #
1. skybrian ◴[] No.26371976[source]
The commit log in Dolt is edit history. (When did someone import or edit the data? Who made the change?) It's not about when things happened.

To keep track of when things happened, you would still need date columns to handle it. But at least you don't need to handle two-dimensional history for auditing purposes. So, in your example, I think the "effective" date columns wouldn't be needed.

They have ways to query how the dataset appeared at some time in the past. However, with both data changes and schema changes being mixed together in a single commit log, I could see this being troublesome.

I suppose writing code to convert old edit history to the new schema would still be possible, similar to how git allows you to create a new branch by rewriting an existing one.