Data Version Control | slacker news

head is a bit discombobulated today, but i’ll give this a shot

when i say ‘blob’ data, a good example to think of is a set of really long 1080p video files.

tl;dr version

* throw data into dvc when unstructured ‘blob’ data.

* throw it into iceberg when you’ve got structured data.

benefits of dvc over iceberg:

* not forcing ‘blob’ data into a tabular format and all the “fun” (read: annoying) processing steps that come with doing that

* don’t have to have to run some processing step to extract ‘blob’ data out of what is basically a parquet file, dvc pull (?) will just download each file as is.

* edit files locally then run three-ish (?) commands to commit changes, without needing to run a data ingestion pipeline to force ‘blob’ data into a table

* completely schema less, so don’t have to worry about ‘blob’ data being the wrong type, just shove it in the repo and commit it

* roll back throughout all of commit history, not just to the last vacuum/checkpoint

basically, tabular data formats and ‘blob’ data shoved into them is a recipe for pain.

shoving ‘blobs’ into a git like repo is much faster and easier.

especially if you need full version history, branches for different outcomes etc.

trying to have different branches in Iceberg for your set of really long 1080p video files where you have applied different ffmpeg filters in different branches and want people to be able to access all of them, and the history of them sounds nightmare-ish.

in dvc, that’s ^ easy.

basically, it’s like creating a data lake which won’t turn into a data swamp because everything is version controlled.