Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge

1. strogonoff ◴[06 Mar 21 21:35 UTC] No.26370733[source]▶

You can also use Git for data!

It’s a bit slower, but smart use of partial/shallow clones can address performance degradation on large repositories over time. You just need to take care of the transformation between “physical” trees/blobs and “logical” objects in your dataset (which may not have 1:1 mapping, as having physical layer more granular reduces likelihood of merge conflicts).

I’m also following Pijul, which seems very promising in regards to versioning data—I believe they might introduce primitives allowing to operate on changes in actual data structures rather than between lines in files, like with Git.

Add to that sound theory of patches, and that’s a definite win over Git (or Doit for that matter, which seems to be same old Git but for SQL).

replies(5): >>26371219 #>>26371307 #>>26371593 #>>26372041 #>>26373741 #

2. teej ◴[06 Mar 21 22:46 UTC] No.26371219[source]▶

>>26370733 (TP) #

The fact that I can use git for data if I carefully avoid all the footguns is exactly why I don’t use git for data.

3. pradn ◴[06 Mar 21 22:58 UTC] No.26371307[source]▶

>>26370733 (TP) #

Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one. Complexity is a liability. Mercurial has a nicer interface. And now I see the real simplicity of non-distributed source control systems. I have never actually needed to work in a distributed manner, just client-server. I have never sent a patch to another dev to patch into their local repo or whatnot. All this complexity seems like a solution chasing after a problem - at least for most developers. What works for Linux isn't necessary for most teams.

replies(3): >>26371364 #>>26371641 #>>26374196 #

4. ttz ◴[06 Mar 21 23:05 UTC] No.26371364[source]▶

>>26371307 #

Git is used prolifically in the tech industry. What on earth are you talking about?

replies(2): >>26371383 #>>26371527 #

5. detaro ◴[06 Mar 21 23:10 UTC] No.26371383{3}[source]▶

>>26371364 #

Being needlessly complicated seldomly stops the tech industry from using something as long as the complexity is slightly out of the way.

replies(1): >>26372665 #

6. strogonoff ◴[06 Mar 21 23:30 UTC] No.26371527{3}[source]▶

>>26371364 #

To me there’s some irony in that all insta-criticism of Git in responses to my comment presumably applies to a project that describes itself as “Git for data” and promises exact reproduction of all Git command behaviour—therefore suffering from the same shortcomings.

7. vvanders ◴[06 Mar 21 23:40 UTC] No.26371593[source]▶

>>26370733 (TP) #

Nope, been there done that, no thanks.

Lack of locking for binary files, overhead > 1gb and all the shenanigans you need to do for proxy servers. There's better solutions out there but they aren't free.

replies(1): >>26371630 #

8. strogonoff ◴[06 Mar 21 23:44 UTC] No.26371630[source]▶

>>26371593 #

Would be very curious to hear more about issues with proxy servers (where were they required?), overheads (do you mean RAM usage?) and locking.

replies(1): >>26372224 #

9. strogonoff ◴[06 Mar 21 23:46 UTC] No.26371641[source]▶

>>26371307 #

Doit boasts its likeness to Git as a feature. Does this mean it’ll also be barely usable for daily tasks? Is it possible for a project to faithfully reproduce the entirety of Git command interface and be less complicated than Git / not suffer from the same shortcomings?

I personally think Git isn’t that bad, once it’s understood. It could be counter-intuitive sometimes though (as an example, for the longest time I used Git without realizing it stores a snapshot of each file and diffs/deltas are only computed when required). Just trying to be pragmatic and not expecting a tool like Git to be entirely free of leaky abstractions.

10. rapjr9 ◴[07 Mar 21 00:53 UTC] No.26372041[source]▶

>>26370733 (TP) #

We used git to store and manage data sets for a machine learning project involving chewing detection with audio data used in training. It was cumbersome and the huge datasets caused some problems with git (e.g., searches of our code base got really slow because the data was being searched also until we moved the data to a different repo). Something easier to use that could manage large datasets would be useful.

I wonder if DoIt could be used to create a clone of Apple's Time Machine. Seems like the basics are there.

11. vvanders ◴[07 Mar 21 01:23 UTC] No.26372224{3}[source]▶

>>26371630 #

Sure keep in mind that my data is a little old but last time I peeked into the git LFS space it seemed like there were still a few gaps.

First, most of my background in this area comes from gamedev so YMMV if the same applies in your use cases.

For our usage we'd usually have a repo history size that crossed the 1TB mark and even upwards of 2-3TB in some cases. The developer sync was 150-200GB, the art sync was closer to 500-600GB and the teams were regularly churning through 50-100GB/week depending on where we were in production.

You need discipline specific views into the repo. It just speeds everything up and means that only the teams that need to take the pain have to. From a performance perspective Perforce blows the pants off anything else I've seen, SVN tries, but P4 was easily an order of magnitude faster to sync or do a clean fetch.

I've seen proxy servers done with git but it's usually some really hack thing scripted together with a ton of ductape and client-specific host overrides. When you have a team split across East Coast/West Coast(or other country) you need that proxy so that history is cached in a way that it only gets pulled in locally once. Having a split push/pull model is asking for trouble and last I checked it wasn't clear to me if stuff like git LFS actually handles locking cleanly across it.

From an overhead perspective git just falls over at ~1gb(hence git LFS, which I've seen teams use to varying degrees of success based on project size). The need to do shallow history and sidestep resolving deltas is a ton of complexity that isn't adding anything.

With a lot of assets, merging just doesn't exist and a DVCS totally falls over here. I've seen fights nearly break out in the hallway multiple times when two artist/animators both forgot to checkout a file(usually because someone missed the metadata to say it's an exclusive access file). With unmergeable binary files that don't get locked your choice is who gets to drop 1-3 days of work on the floor when the other person blows away their changes to commit. If those changes span multiple interconnected packages/formats/etc you have a hard fork that you can never bring back together.

There's a couple other details but that's the large ones, Perforce worked incredibly well in this space but it is not cheap and so I've seen teams try to go their own way to mixed success. I'll admit that you can't do a monorepo in P4(and even tools like repo in the Android world have their problems too) but if you segregate your large business/product lines across P4 repos it scales surprisingly well.

Anyway, you may or may not hit any or all of this but I've yet to see git tackle a 1TB+ repo history well(and things like repo that uses many mini-repos doesn't count in my book due to the lack of atomicity across submissions that span multiple repos).

replies(1): >>26372471 #

12. strogonoff ◴[07 Mar 21 02:02 UTC] No.26372471{4}[source]▶

>>26372224 #

This is super informative!

In my case it’s different since Git isn’t accessed by users directly, rather I’m working on some tools that work on top of Git (on user’s machine). Data is primarily text-based, though sometimes binary assets come up (options for offloading them out of Git are being investigated).

So far there were no major issues. I predict degradation over time as repos grow in size and history (Git is not unique in this regard, but it’ll probably be more rapid and easier to observe with Git), so we might start using partial cloning.

(I stand by the idea that using straight up Git for data is something to consider, but with an amendment that it’s predominantly text data, not binary assets.)

replies(1): >>26373110 #

13. ttz ◴[07 Mar 21 02:45 UTC] No.26372665{4}[source]▶

>>26371383 #

"Barely usable for daily tasks"

Is a pretty strong statement, especially given many tech companies use it exactly for this purpose.

Git might have a learning curve, and sure, it's not the simplest. But "barely usable" is hyperbole in the face of actual evidence.

I'm not defending Git specifically ; other SVNs are just as viable. The quoted statement seems a bit ridiculous.

14. vvanders ◴[07 Mar 21 04:22 UTC] No.26373110{5}[source]▶

>>26372471 #

Yeah, my experience has been that you start seeing issues with long delta decompression times around the 1-2gb mark. That climbs quicker of you have binary formats that push the delta compression algorithm into cases where it does poorly(which makes sense since it was optimized for source code).

If you have binary assets and they don't support merging or regeneration from source artifacts that mandates locking(ideally built into SCM but I've seen wiki pages in a pinch at small scale).

15. graderjs ◴[07 Mar 21 06:41 UTC] No.26373741[source]▶

>>26370733 (TP) #

I find a balance between this using git on JSON files. And I build the JSON files into a database (1 file per record, 1 directory per table, subdirectories for indexes). The whole thing is pretty beautiful, and it's functioning well for a user-account, access management database I'm running in production. I like that I can go back and do:

`git diff -p` to see the users who have signed up recently, for example.

You can get the code, over at: https://github.com/i5ik/sirdb

The advantages of this approach are using existing unix tooling for text files, solid versioning, easy inspect-ability, and leveraging the filesystem B-Tree indexing as a fast index structure (rather than having to write my b-trees). Another advantage is hardware-linked scaling. For example, if I use regular hard disks, it's slower. But if I use SSDs it's faster. And i should also be possible to mount the DB as a RAM disk and make it super fast.

The disadvantages are that the database side still only supports a couple of operations (like exact, multikey searches, lookup by ID, and so on) rather than a rich query language. I'm OK with that for now, and I'm also thinking of using skiplists in future to get nice ordering property for the keys in an index so I can easily iterate and page over those.

16. Hendrikto ◴[07 Mar 21 08:16 UTC] No.26374196[source]▶

>>26371307 #

> Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one.

Cars are too complicated. They are barely usable for daily tasks. Look at how many people have to Google for basic things like changing a fan belt, or fixing cylinder head gasket.

You can fill in almost anything here. Most tools are complicated. Yet yo don’t need to know their ins and outs for them to be useful to you.

replies(1): >>26377022 #

17. yoavm ◴[07 Mar 21 16:14 UTC] No.26377022{3}[source]▶

>>26374196 #

To me it sounds like you're proving the exact opposite. I'd assume most car owners never need to change a fan belt themselves, while everyone who uses git daily needed at some point to revert a commit. "How to turn right" isn't huge on stackoverflow last time I checked...