Most active commenters
  • peteforde(4)
  • TylerE(4)
  • ken(3)
  • enos_feedler(3)

←back to thread

Dolt is Git for data

(www.dolthub.com)
358 points timsehn | 64 comments | | HN request time: 1.889s | source | bottom
1. peteforde ◴[] No.22734564[source]
Only 39 days since the last "GitHub for data" was announced: https://news.ycombinator.com/item?id=22375774

I'll say what I said in February: I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a lot of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.

I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.

https://www.youtube.com/watch?v=EWMjQhhxhQ4

replies(15): >>22734677 #>>22734738 #>>22734742 #>>22734839 #>>22735019 #>>22735030 #>>22735213 #>>22735358 #>>22735661 #>>22736049 #>>22736513 #>>22736785 #>>22737514 #>>22737860 #>>22738642 #
2. sjtindell ◴[] No.22734677[source]
Just want to say, really appreciate this food for thought. Where do you go and see someone say "my company tried it and...". This site is a godsend.
replies(1): >>22735831 #
3. timsehn ◴[] No.22734738[source]
I'll check it out. We think the world is a little more ready for this now, given how widely Git is adopted and the advances in other data tooling (like ML). But, as we're all aware, starting a business is hard :-)
4. zyang ◴[] No.22734742[source]
The need is definitely there. My day job involves such need. But we simply cannot trust a drive-by startup to fill the gap. It's safer just to roll your own.
replies(1): >>22734918 #
5. philipov ◴[] No.22734839[source]
Git succeeded because it was free, and then business models were able to be built up around the open-source ecosystem after a market evolved naturally. There is a need, but if you go into it trying to build a business from scratch, you're going to have a bad time.
replies(2): >>22735082 #>>22735219 #
6. timsehn ◴[] No.22734918[source]
Dolt is and always will be open source. Feel free to fork a copy and make it your own.
7. drewmol ◴[] No.22735019[source]
Do you consider your effort a mistake looking back? Were you able to pay yourself fairly or was it a losing investment?
replies(1): >>22735330 #
8. ken ◴[] No.22735030[source]
That's GitHub for data. It's a service, and they still haven't launched anything yet.

This is Git for data. It's a program, and it appears to be an open-source one you can download and use today.

replies(2): >>22735037 #>>22735068 #
9. pavlov ◴[] No.22735037[source]
The OP points to a site called DoltHub.com, so it’s not like they don’t have ambitions to commercialize as another “GitHub for data”.
replies(1): >>22737491 #
10. enos_feedler ◴[] No.22735068[source]
There is actually an old git for data project too:

https://github.com/datproject/dat

It's ~5 years old and I really wanted it to be huge. Hoping this new project is a success. Especially since I notice I went to high school with one of the founders of Dolt (Hey Tim!)

replies(3): >>22735185 #>>22736907 #>>22737454 #
11. TylerE ◴[] No.22735082[source]
Git succeeded because of Linus.

Sure as hell wasn't because of the UX, else Mercurial would have won, or even DARCS.

99.99999% of projects are not the Linux kernel

replies(7): >>22735432 #>>22735442 #>>22735596 #>>22735880 #>>22736021 #>>22736399 #>>22737152 #
12. visarga ◴[] No.22735185{3}[source]
Can it remove a file from the repo history? It's a GDPR feature that makes git hard to use for data.
13. staflow ◴[] No.22735213[source]
1. Develop a solution with catchy name.

2. ONLY and only after it’s finished, start looking for problem that fits that solution.

3. Realize there is none.

4. ?????

5. Profit!

Gee I wonder why our economy is so ineffective..

replies(1): >>22735273 #
14. waynesonfire ◴[] No.22735219[source]
being free is a huge plus; but more equally or even importantly, it's a better product.
replies(1): >>22735430 #
15. chmaynard ◴[] No.22735273[source]
Step 4 is where you create a snazzy website with lots of seductive images, clever animations, and "Free Trial" buttons.
16. peteforde ◴[] No.22735330[source]
The answer to that can only be nuanced.

On one hand, startups are as exciting as bank heists. You put together an amazing team, do your homework as best you can, then get killed trying to execute your perfect plan. I'm proud of what we built and we all learned a lot the hard way.

On the down-side, building a company is emotionally, physically, mentally exhausting. It wasn't really a matter of whether I could pay myself fairly; I drew a typical programmer salary along with everyone else.

However, the important detail is that you're spending someone else's money and every penny of it represents someone you respect putting their trust in you, and you feel the weight of that every day.

Ultimately, I don't exactly regret it but I certainly wish that we weren't so convincing that we convinced ourselves of a market opportunity that we couldn't access or didn't exist at all. There was so much heat for "data" in 2011 it really seemed like we just had to show up with an amazing product.

We were wrong.

replies(1): >>22735638 #
17. kidintech ◴[] No.22735358[source]
Palantir is successful as well, but no clue what the hateboner on HN for them is, since it never even gets mentioned in these threads.
replies(1): >>22735367 #
18. th3iedkid ◴[] No.22735367[source]
Is it because much of palantir success is with large enterprises and not startups?
replies(1): >>22735378 #
19. kidintech ◴[] No.22735378{3}[source]
OP says "we couldn't find anyone with an urgent problem that they were willing to pay to solve" and "there was no market to capture". Wouldn't a counter example contradict those statements, regardless of the size of Palantir's clients?
20. dagw ◴[] No.22735430{3}[source]
Was git really better than for example Mercury or Darcs for most 'normal' projects when it was released? I certainly don't remember it as such. It was certainly better for Linus and the specific workflow problems he had with kernel development, but I don't recall it being better overall at the time.

What its release (and the very public BitKeeper spat leading up its release) did do was bring the idea of distributed VCS to the forefront.

replies(1): >>22735963 #
21. sdan ◴[] No.22735432{3}[source]
I'd argue it succeeded (at least more recently) because of the UX that companies like Github and Gitlab gave, not particularly linus or because it was free.
replies(2): >>22735471 #>>22735613 #
22. greggman3 ◴[] No.22735442{3}[source]
Mercurial would not have won. Mercurial has since added features, that are not the recommended workflow according to their docs, to have similar branching model to git but the default "as designed" workflow of hg is arguably inferior to git (yes, I know that word will get downvoted).

Without git, git's style of branching would likely never have been added to hg and even though it's been added now AFAICT hg people don't use it. No idea why. Git people get how much freedom git branches give them, freedom that other vcs, include hg don't/didn't.

replies(2): >>22735510 #>>22735515 #
23. TylerE ◴[] No.22735471{4}[source]
If Linus hasn’t pushed it, it never would have caught on.
replies(1): >>22735654 #
24. koonsolo ◴[] No.22735510{4}[source]
Git branching is not intuitive, because they are not branches but pointers/labels. When you talk about the master branch, you actually talk about the master pointer.

The other VCSes have an intuitive concept of branches, because they are in fact branches.

I liked Mercurial more than Git, but when BitBucked dropped Mercurial I also switched to Git.

replies(4): >>22735889 #>>22735990 #>>22736020 #>>22736421 #
25. radarsat1 ◴[] No.22735515{4}[source]
Nice to read this. I was trying to collaborate on a single project that used Mercurial, and man as a git user I could not understand the branching model.. had the hardest time. I ended up working from a local git repo, doing my work there, and then very carefully pushing the commits one at a time at the very end. If I made a mistake, I basically re-cloned the Hg repo because apparently editing history is a no-no. I found the experience very frustrating.

Sibling comment to mine:

> Git branching is not intuitive, because they are not branches but pointers/labels.

Funny, that's exactly why I DO find git branches more intuitive.

26. akvadrako ◴[] No.22735596{3}[source]
Before git came along monotone was looking like the best DVCS.

But it had no chance to compete with Linus’ marketing power.

27. babuskov ◴[] No.22735613{4}[source]
Github was made because people wanted to use Git really bad, but the UX wasn't there. Git was already successful at that point.

Git succeeded because it was good. Github just made it more accessible.

28. dan00 ◴[] No.22735638{3}[source]
> Ultimately, I don't exactly regret it but I certainly wish that we weren't so convincing that we convinced ourselves of a market opportunity that we couldn't access or didn't exist at all.

How should you really know this without trying? If you aren't convinced, where should the motivation come from?

replies(1): >>22736438 #
29. bryanrasmussen ◴[] No.22735654{5}[source]
Sure lots of good things would never catch on except an influential person in the field the good thing targets says: hey this is really useful for us!

This gets lots of people to look at it. But those people still have to decide in the end of the day whether it is actually useful for them.

replies(1): >>22735681 #
30. roystonvassey ◴[] No.22735661[source]
In our group we use git for code repos and cloud for storage and actual compute. It works seamlessly and git APIs work fantastically with almost any service, IDE or whatever your tool of choice.

I suspect with the increasing cloud adaption, accessing data is getting easier by the day and I see no real need for a “git for data” tool. Plus, as a data scientist, it allows me to keep code and data separate, especially if I’m working with confidential data.

31. sdan ◴[] No.22735681{6}[source]
Reminds me of sports:

Just because Steph Curry uses Under Armor doesn't mean everyone will too.

I think obviously something created by Linus was deemed to be of great value, but what made Git as profound as it is today is: UX for the majority of people who are beginners, which was Github backed by millions of other developers.

32. austinjp ◴[] No.22735831[source]
Agreed, it's always refreshing to read candid posts like this. A useful term for Google is "startup post mortem" although you still have to do plenty of sifting.
33. stevekemp ◴[] No.22735880{3}[source]
DARCS would not have seen significant further growth, due to the merge-of-doom problem.
replies(1): >>22737441 #
34. ynx ◴[] No.22735889{5}[source]
I must be an outlier, because it's always been the opposite for me.

I started on Mercurial and didn't use Git for years. The moment I switched to Git everything made so much more sense to me. Mercurial seemed like it did magic and wouldn't explain it to you. There were multiple kinds of branches, there were revision numbers, octopus merges were impossible to understand, the whole thing tried to act immutable but effective workflows included history editing for squashing and merging and amending and cherry-picking, which is anything but. Partial commits were a little bit of a mystery to me, and shelves seemed to be their own separate thing.

To me Git was simple in comparison. The working copy was the last state at the end of a long sequence of states. Patches were just the way you represented going from one state to another, rather than canonical, so you woujldn't resolve an octopus merge so much as you would get to your desired state and call it a day. Branches were labels to a particular state. Stashes were labels with an optimized workflow. Reflog was just a list of temporary-ish labels. New commits were built against the index, which you could add or remove to independently of file state. Branches were branches were branches, no matter where the repository was. Disconnecting from upstream was simply a matter of removing a remote.

I know it doesn't match up with other people, but I simply have never been able to see Mercurial as an example of a good tool /despite starting on it/. It's always been easier to use git at any level of complexity I need it depending on the problem I'm solving, whether it's saving code or rescuing a totally botched interactive rebase, merge, etc.

35. jfkebwjsbx ◴[] No.22735963{4}[source]
Yes, it was way better. That is why people started using it outside the kernel. People forget how old Git is and think GitHub started it.

Hg back in the day was quite limited and followed other VCS paradigm (which was enforced on you, by the way).

36. barrkel ◴[] No.22735990{5}[source]
Git branches as labels into a DAG of edits maps exactly to what I think branches are. The difference between two branches is their respective edits from a common base. If you muck up a commit, you reset the pointer to the previous commit. If you muck that up, and accidentally reset too much, you can use your reflog to find out where you used to be on the DAG and reset the branch to that.

The transparency of the mechanism enables the user to be more powerful while knowing fewer concepts in total. The power of the system comes from the composition of simple parts.

37. Camillo ◴[] No.22736020{5}[source]
The intuitive concept of a branch is a limb on a tree.
38. tinco ◴[] No.22736021{3}[source]
99.99999% of projects are not the Linux kernel, so how could Git have succeeded because of Linus, other than Linus originating the genius design of it? The Ruby community jumped onto Git even though there was no Github, and Ruby itself didn't use Git. In my opinion it was because Git was the first tool that was superior to SVN in every way.

The first time I used Git I swore I would never use SVN again. It was even popular back then to set up git+svn systems so you could work on your git repo, and push a branch to svn to satisfy your employer.

People associate git with Github (and Gitlab), but it used to be very common to just set up a ssh server that people could push projects on to, my server still has a dozen or so projects on it that I haven't touched in a decade. Github spawned from the popularity of Git in the Ruby community, and the desire to make it a little more accessible to people that didn't want to have their own git servers.

replies(1): >>22736394 #
39. deforciant ◴[] No.22736049[source]
We also started "Git for data" several years ago but since then pivoted to data science/ML tooling (https://dotscience.com/) by building features that people actually want on the original product. Since then the "git for data" accounts only probably for 5% of the total functionality :)

I guess "Git for data" is not very useful if you don't have the whole platform built around it to actually use the features. We mainly use it for data synchronization between the nodes and provenance tracking so people can see what data was used to build specific models and to track how the project evolves itself without forcing people to "commit" their changes manually (as we have seen that often data scientists don't even use git, just files on their Jupyter notebooks).

40. wainstead ◴[] No.22736394{4}[source]
> so how could Git have succeeded because of Linus, other than Linus originating the genius design of it?

Perhaps that is exactly the point. There was a fair amount of hype and press coverage over Git when it was first unveiled. And it was because Linus wrote it, and wrote it in an unexpectedly short time. And it was on the coattails of the whole Bitkeeper saga.

replies(1): >>22744074 #
41. ComodoHacker ◴[] No.22736399{3}[source]
GP was probably meaning GitHub, not Git.
42. Jestar342 ◴[] No.22736421{5}[source]
AFAIK (from the rumour mill and not from any kind of reliable source) the `git branch` command was only added as a cargocult from all the SVN users flocking to git and asking "So how do I branch?!". Previous to this, everything was tags and checkouts.

Again, no verifiable source, just water cooler talk with other devs.

replies(1): >>22737250 #
43. peteforde ◴[] No.22736438{4}[source]
Actually, I do have a pretty good answer to this.

Every business has about 100 questions where if you know the answer to those questions, you are quite likely to be successful. The hard part is knowing which questions need to be asked of each business.

To be clear: my co-founders and I were all seasoned, multi-startup people. We had extremely high-calibre investors with finely tuned bullshit meters. In the end, our own pitching skill undid us because I think less-convincing founders would have undergone 25% more scrutiny and that would have required us to demonstrate that we had signed LOIs from 3-5 real customers before starting.

There was a subconscious misdirection around the fact that we didn't actually have anyone beating down our door. We let the excitement for "data", our personalities and our track records carry the moment.

Of course founders have to drink their own Kool-Aid to some extent or they won't make it. But there's real power and value in the customer development mindset. People want this? Prove it.

44. NicoJuicy ◴[] No.22736513[source]
> that there was not a market opportunity to capture

Weird, there is an internal project ( already live) at the company ( multinational) where i work for that it's used.

45. solatic ◴[] No.22736785[source]
I wonder how much of the market opportunity here is contingent upon market education. Everybody can clearly see the value of having a personal automobile, but how successful can you be at selling automobiles to people who don't know how to drive? Do people desire cars enough to buy one if they don't know how to drive?

Everyone can see how FAANG companies are growing wealthy off the mountains of data they are amassing, so everyone understands how data can be desirable. But what if your potential market base doesn't understand how to "drive" data - how to identify which data would be valuable for them and how best to exploit it? It seems to me that part of a go-to-market strategy needs, at least in the short term, to help potential customers transition from "that's a really shiny bauble" to "I understand how this is going to make me money."

replies(1): >>22737516 #
46. cbenz ◴[] No.22736907{3}[source]
Dat is more about distribution (décentralized, distributed, P2P) but it's not possible to make queries.
replies(1): >>22738838 #
47. tannerbrockwell ◴[] No.22737152{3}[source]
Git exists, because Bitkeeper were being Aholes. [1] A developer needed some metrics on the Bitkeeper repository that Linux used. Remember this is a proprietary and commercial product that granted a handful of licenses to the Linux community as a token of support. So when Andrew Tridgell reversed engineered the format that Bitkeeper used, they threatened to sue him under the DMCA.

This caused a firestorm, some defended him, others defended Bitkeeper, and a lot of people said why the hell is Linus using proprietary software to manage an Open Source project?!?!! Linus waded in and said he'd think about it, I think was on a thursday or friday, and by the next week he had working python prototype of git. [2] The rest is history. Bitkeeper faded into irrelevance and git became the lingua franca for open source projects. Arguably its biggest strength was not revision control, but being designed in manner that many collaborators could seamlessly commit changes for merging. Obviously architected to fulfill the time consuming requirements of Linus Torvalds, it has stood a test of time. I'm writing this from memory, so if it disagrees with Wikipedia take it with a grain of salt.

[1]: https://en.wikipedia.org/wiki/BitKeeper#Original_license_con... [2]: https://en.wikipedia.org/wiki/Git#History

replies(1): >>22738686 #
48. tosser678 ◴[] No.22737250{6}[source]
from the first kernel merge (link found in wikipedia)

https://marc.info/?l=git&m=111377572329534

I don't know about 'git branch', but it looks like 'git merge' wasn't a thing

edit: from searching a bit, it appears that it had branches on June of the launch year, dunno if it had those on release.

replies(1): >>22738581 #
49. Tomte ◴[] No.22737441{4}[source]
It's been years since I have read about darcs. Did they fix that infinite time merge problem at some point?
replies(1): >>22739259 #
50. ken ◴[] No.22737454{3}[source]
That project looks like a command-line p2p file sharing system. There doesn't appear to be any branching. It also doesn't appear to be a database (like with a schema), but simply raw files being passed around. There's no data types or queries.

I'm not sure why you bring it up now. They don't call it "git for data" anywhere that I see, and it's missing 2 of the 3 core features that I think a "git for data" would need to have.

replies(1): >>22738714 #
51. ken ◴[] No.22737491{3}[source]
Sure, but the existence of GitHub has only helped Git grow, not hurt it.
52. dmitryminkovsky ◴[] No.22737514[source]
You may be right in the big picture, but not all "GitHubs for data" are the same. This product seems super cool:

> Dolt is Git for data. Instead of versioning files, Dolt versions tables. DoltHub is a place on the internet to share Dolt repositories. As far as we can tell, Dolt is the only database with branches.

I find it hard to believe no other database has branches, but if that's true and if this product works like you'd imagine, that is really cool.

Given your historical observation, I think you're right that this will not lead to a market revolution, but sometimes you need the right product to change the landscape.

53. peteforde ◴[] No.22737516[source]
Maybe, but also maybe there just isn't a huge demographic of data scientists with discretionary purchasing capacity and a hair-on-fire problem that they are desperately searching for someone to take their money and fix.

I think that a lot of the data VIP types we met with honestly wanted to know why they needed it, but the more they thought about it, the more it just seemed like a shiny thing.

It's telling that dozens of similar companies with smart people behind them have thrown their talents at this solution, and none of them have located the problem people are eager to pay to solve.

54. nailer ◴[] No.22737860[source]
> we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.

Dat is still around, on version 14, last update in git was 5 days ago.

https://github.com/datproject/dat https://dat.foundation/

replies(1): >>22737922 #
55. moondowner ◴[] No.22737922[source]
Impressed by the documentation https://datprotocol.github.io/how-dat-works/
56. enigmo ◴[] No.22738581{7}[source]
The git log is also handy.

first "merge": https://git.kernel.org/pub/scm/git/git.git/commit/?id=33deb6...

first "tag": https://git.kernel.org/pub/scm/git/git.git/commit/?id=bf0c6e...

first "branch": https://git.kernel.org/pub/scm/git/git.git/commit/?id=74b242...

first Linus "branch" commit: https://git.kernel.org/pub/scm/git/git.git/commit/?id=e69a19...

57. specialist ◴[] No.22738642[source]
Maybe just too early?

Grant funders are starting to require teams to publish their code and data. Maybe they're the target audience? Data repos vendors could get on the list of approved vendors for teams receiving funding.

58. specialist ◴[] No.22738686{4}[source]
Events like this, even in the small, keep me from outright dismissing "hero based" (whatever it's called) theories of explaining history.

Coincidences, accidents, grudges, misunderstandings coupled with path dependencies.

replies(1): >>22745881 #
59. enos_feedler ◴[] No.22738714{4}[source]
Like I said, this project is old. I brought it up in the context of older projects, independent of whether they succeeded pivoted, etc. If you did some research you would have made this connection:

https://www.youtube.com/watch?v=FX7qSwz3SCk (2013) - 'Introducing Dat: If Git Were Designed For Big Data' Talk by the founder.

My point is they pivoted and so maybe this idea won't work, or this was too early.

EDIT: Looking back on _your_ post, I mentioned it because you specifically said "It's a program, and it appears to be an open-source one you can download and use today." And that is what 'dat' is/was. I thought I would mention it.

60. enos_feedler ◴[] No.22738838{4}[source]
Yes, the project used to associate itself as a git for data, but I guess not in the sense of a db.
61. stevekemp ◴[] No.22739259{5}[source]
Not entirely fixed by the look of things:

Darcs 2 (introduced in 2008-04) reduces the name of scenarios that will trigger an exponential merge. Repositories created with Darcs 2 should have fewer exponential merges in practice.

http://darcs.net/FAQ/Performance#is-the-exponential-merge-pr...

62. TylerE ◴[] No.22744074{5}[source]
Similar to how, say, Go and Rust became popular while nim and D have largely remained niche products

PS: Hi, Steve!

replies(1): >>22814445 #
63. TylerE ◴[] No.22745881{5}[source]
https://en.wikipedia.org/wiki/Great_man_theory
64. wainstead ◴[] No.22814445{6}[source]
Hi Tyler! :)