Nevertheless, starred. Let’s see what does it give.
Nevertheless, starred. Let’s see what does it give.
How does this compare to something like Pachyderm?
How does it work under the covers? What is a splice and what does it mean when it overlaps? https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...
Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?
For instance, how much of a functional wiki could one assemble from off-the-shelf parts? Edit, display, account management, templating, etc could all be handled with existing libraries in a wide array of programming languages.
The logic around the edit history is likely to contain the plurality if not the majority of the custom code.
Even if it's a joke on yourself, just like, why would you give anyone who hasn't heard of your project the idea that it might be mean?
You wouldn't name your pet Dumbass. Why your pet project.
For instance: stop being such a git - this is not imploring someone not to be stupid, it's saying don't be annoying, awkward etc.
Dolt even has some content in a GitHub wiki, which uses git as a database backend for a web app: https://github.com/liquidata-inc/dolt/wiki
We got a blog on the storage system coming on Wednesday. It's a mashup of a Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source package called Noms (https://github.com/attic-labs/noms).
I'm not familiar with CRDT. Will read up on that.
I'll say what I said in February: I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a lot of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.
I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.
The post Tim links here is a very apt description of what Pachyderm does. We're designed for version controlling data pipelines, as well as the data they input and output. Pachyderm's filesystem, pfs, is the component that's most similar to dolt. Pfs is a filesystem, rather than a database, so it tends to be used for bigger data formats like videos, genomics files, sometimes databases dumps. And the main reason people do that is so they can run pipelines on top of those data files.
Under the hood the datastructures are actually very similar though, we use a Merkle Tree, rather than a DAG. But the overall algorithm is very similar. Dolt, I think, is a great approach to version controlling SQL style data and access. Noms was a really cool idea that didn't seem to quite find its groove. Whereas dolt seems to have taken the algorithm and made it into more of a tool with practical uses.
For example, if you have a Feature Store for ML, and you want to say "Give me train/test data for these features for the years 2012-2020". This isn't possible with versioned immutable data items. Also, if you don't store the diffs in data - if you store immutable copies, you get explosive growth in data volumes. There are 2 (maybe 3) frameworks that allow such time-travel queries i am aware of: Apache Hudi (Uber) and Databricks Delta. (Apache Iceberg by Netflix will have support soon.)
Reference:
https://www.logicalclocks.com/blog/mlops-with-a-feature-stor...
UUID,CategoryA,CategoryACount,CategoryB,CategoryBCount,BooleanC,BooleanD...etc
When we ingest a new UUID, we add a column "START_DATE" which is the first date the UUID's metrics were valid. When any of the metric counts changes, we add "END_DATE" to the row and add a new row for that UUID with an updated START_DATE.
It works, but it sucks to analyse because you have to partition the database by the days each row was valid and do your aggregations on those partitions. And it sucks to get a snapshot of how a dataset looked on a particular day. It would be much easier if we could just access the daily diffs, which seems like a task Dolt would accomplish.
I mean it has a better chance of working than getting the third party to implement versioning on their data feed.
The system is still pretty simple. The main cost is the storage for the blobs in the Dolt repos pushed to DoltHub. We use S3 for that. There is an API that receives pushes and writes any other metadata (user, permissions, etc) into an RDS instance that stores metadata for DoltHub. That instance is also used to cache some critical things. Then it's just a set of web servers and a GraphQL sitting on top serving our React app.
git-lfs lets you put store large files on GitHub. With Dolt, we offer a similar a similar utility called git-dolt. Both these allow you to store large things on GitHub as a reference to another storage system, not the object itself.
https://www.dolthub.com/blog/2020-02-24-data-licenses/ https://www.dolthub.com/blog/2020-02-26-copyrightable-materi...
But there are some limitations for large tables. Diffs only work if you order the file in the same order. Plus, you have to import the data into some other tool to do anything with it.
I would assume there's an automated test suite, but also some way of diffing large amounts of input data and visualizing those input additions relative to model classifications?
What are the common tools for this?
There's also litetree, whose slogan is simply "SQLite with branches":
AS OF: https://www.dolthub.com/blog/2020-03-20-querying-historical-...
HISTORY SYSTEM TABLE: https://www.dolthub.com/blog/2020-01-23-access-to-everything...
You can run `dolt q -r csv -q <query>` to output whatever you want to a CSV. We would need to do work to output a hierarchical format.
I'm sure it's possible to build whatever time travel operation you want. We can produce an audit log of every cell in the database pretty quickly.
GIN: https://gin.g-node.org/ datalad: https://www.datalad.org/
At the time GIN looked really promising as something potentially simple enough for end users in the lab but with a lot of power behind it. (Unfortunately we never got it deployed due to organizational constraints... but that's a separate story.)
We think over time (like years) we can achieve read performance parity with MySQL or PosgreSQL. Architecturally, we will always be slower on write than other SQL databases, given the versioned storage engine.
Right now, Dolt is built to be used offline for data sharing. And in that use case, the data and all of its history needs to fit on a single logical storage system. The biggest Dolt repository we have right now is 300Gb. It tickles some performance bottlenecks.
In the long run, if we get traction we imagine building "big dolt" which is a distributed version of Dolt, where the network cuts happen at logical points in the Merkle DAG. Thus, you could run an arbitrarily large storage and compute cluster to power it.
https://github.com/datproject/dat
It's ~5 years old and I really wanted it to be huge. Hoping this new project is a success. Especially since I notice I went to high school with one of the founders of Dolt (Hey Tim!)
We had a place where we would put up a dialog that would say “Do It”, or “Cancel”, and we’d give somebody a document and say “Here, edit this and save it”, and they’d get to the point where they’re supposed to choose “Do It”, and they’d look a little miffed, and then hit cancel. And we saw that several times, different people.
[...] And when we saw this [in the video recordings], people looking a little miffed and then hitting cancel instead of do it, we turned up the volume and played it back, and [heard them mutter]: “What’s this ‘Dolt’?” I’m no dolt! So I hit cancel”.
https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=...
https://www.ahdictionary.com/word/search.html?q=git
Dolt - A stupid person; a dunce.
Related areas are confidence calibration, active learning and hard example detection during training. Another approach is to synthesise a new, much smaller dataset that would train a neural net to the same accuracy of the original larger dataset.
On one hand, startups are as exciting as bank heists. You put together an amazing team, do your homework as best you can, then get killed trying to execute your perfect plan. I'm proud of what we built and we all learned a lot the hard way.
On the down-side, building a company is emotionally, physically, mentally exhausting. It wasn't really a matter of whether I could pay myself fairly; I drew a typical programmer salary along with everyone else.
However, the important detail is that you're spending someone else's money and every penny of it represents someone you respect putting their trust in you, and you feel the weight of that every day.
Ultimately, I don't exactly regret it but I certainly wish that we weren't so convincing that we convinced ourselves of a market opportunity that we couldn't access or didn't exist at all. There was so much heat for "data" in 2011 it really seemed like we just had to show up with an amazing product.
We were wrong.
So, as I understand, as long as the DB can be converted into files, it will work as anything else on Git and GitHub. What am I missing?
https://github.com/attic-labs/noms
(Judging by other comments in this thread, Dolt may be a descendant, partially or completely, of Noms?)
What its release (and the very public BitKeeper spat leading up its release) did do was bring the idea of distributed VCS to the forefront.
Without git, git's style of branching would likely never have been added to hg and even though it's been added now AFAICT hg people don't use it. No idea why. Git people get how much freedom git branches give them, freedom that other vcs, include hg don't/didn't.
https://github.com/liquidata-inc/dolt#credits-and-license
(and sorry :( -- yay open source?)
The other VCSes have an intuitive concept of branches, because they are in fact branches.
I liked Mercurial more than Git, but when BitBucked dropped Mercurial I also switched to Git.
Sibling comment to mine:
> Git branching is not intuitive, because they are not branches but pointers/labels.
Funny, that's exactly why I DO find git branches more intuitive.
https://github.com/attic-labs/noms/blob/master/doc/intro.md
To answer your question, it is pretty easy to make Noms (or Dolt) into a CRDT by defining a merge function that is deterministic.
We experimented with this in Noms but the result wasn't that satisfying and we didn't take it any further:
https://github.com/attic-labs/noms/blob/master/doc/decent/ab...
Git take existing files, and allow you to version them.
Git for data would take existing tables or rows, and allow you to version them.
A uniform, drop in, open source way to have an history or row, merge them, restore them, etc. that works for Postgres, Mysql or Oracle in the same way. And is compatible with migrations.
You can have an history if you use big table or couchdb, not need for Dolt if it's about using a specific product.
How should you really know this without trying? If you aren't convinced, where should the motivation come from?
Overall seen many of these lately, waiting for one to really shine. But not because I think it's a grand problem, as I can version my DDL/DML even/code, but I see some need for it because I have a lot of non-tech people working with data throwing it left and right and expecting me to clean up after them.
This gets lots of people to look at it. But those people still have to decide in the end of the day whether it is actually useful for them.
I suspect with the increasing cloud adaption, accessing data is getting easier by the day and I see no real need for a “git for data” tool. Plus, as a data scientist, it allows me to keep code and data separate, especially if I’m working with confidential data.
Just because Steph Curry uses Under Armor doesn't mean everyone will too.
I think obviously something created by Linus was deemed to be of great value, but what made Git as profound as it is today is: UX for the majority of people who are beginners, which was Github backed by millions of other developers.
And I personally find Noms incredibly satisfying mental model to work with, so I hope that eventually some others will too.
There's a lot of different ways that you could interpret the name Linus chose. That's part of what made it clever.
I started on Mercurial and didn't use Git for years. The moment I switched to Git everything made so much more sense to me. Mercurial seemed like it did magic and wouldn't explain it to you. There were multiple kinds of branches, there were revision numbers, octopus merges were impossible to understand, the whole thing tried to act immutable but effective workflows included history editing for squashing and merging and amending and cherry-picking, which is anything but. Partial commits were a little bit of a mystery to me, and shelves seemed to be their own separate thing.
To me Git was simple in comparison. The working copy was the last state at the end of a long sequence of states. Patches were just the way you represented going from one state to another, rather than canonical, so you woujldn't resolve an octopus merge so much as you would get to your desired state and call it a day. Branches were labels to a particular state. Stashes were labels with an optimized workflow. Reflog was just a list of temporary-ish labels. New commits were built against the index, which you could add or remove to independently of file state. Branches were branches were branches, no matter where the repository was. Disconnecting from upstream was simply a matter of removing a remote.
I know it doesn't match up with other people, but I simply have never been able to see Mercurial as an example of a good tool /despite starting on it/. It's always been easier to use git at any level of complexity I need it depending on the problem I'm solving, whether it's saving code or rescuing a totally botched interactive rebase, merge, etc.
Hg back in the day was quite limited and followed other VCS paradigm (which was enforced on you, by the way).
The transparency of the mechanism enables the user to be more powerful while knowing fewer concepts in total. The power of the system comes from the composition of simple parts.
The first time I used Git I swore I would never use SVN again. It was even popular back then to set up git+svn systems so you could work on your git repo, and push a branch to svn to satisfy your employer.
People associate git with Github (and Gitlab), but it used to be very common to just set up a ssh server that people could push projects on to, my server still has a dozen or so projects on it that I haven't touched in a decade. Github spawned from the popularity of Git in the Ruby community, and the desire to make it a little more accessible to people that didn't want to have their own git servers.
I guess "Git for data" is not very useful if you don't have the whole platform built around it to actually use the features. We mainly use it for data synchronization between the nodes and provenance tracking so people can see what data was used to build specific models and to track how the project evolves itself without forcing people to "commit" their changes manually (as we have seen that often data scientists don't even use git, just files on their Jupyter notebooks).
Cool to see another approach at this.
From the first look, I miss the representation of data as plain-old-text-files, but I guess that's a little bit in competition with the goal of getting performance for larger data sets.
Anyway, I am wondering, did somebody here try using plain git like a database to track data in a repository?
[1] https://academy.realm.io/posts/altconf-wil-shipley-git-docum...
https://github.com/paulfitz/daff
and Coopy (distributed spreadsheets with intelligent merges)
Perhaps that is exactly the point. There was a fair amount of hype and press coverage over Git when it was first unveiled. And it was because Linus wrote it, and wrote it in an unexpectedly short time. And it was on the coattails of the whole Bitkeeper saga.
Again, no verifiable source, just water cooler talk with other devs.
Every business has about 100 questions where if you know the answer to those questions, you are quite likely to be successful. The hard part is knowing which questions need to be asked of each business.
To be clear: my co-founders and I were all seasoned, multi-startup people. We had extremely high-calibre investors with finely tuned bullshit meters. In the end, our own pitching skill undid us because I think less-convincing founders would have undergone 25% more scrutiny and that would have required us to demonstrate that we had signed LOIs from 3-5 real customers before starting.
There was a subconscious misdirection around the fact that we didn't actually have anyone beating down our door. We let the excitement for "data", our personalities and our track records carry the moment.
Of course founders have to drink their own Kool-Aid to some extent or they won't make it. But there's real power and value in the customer development mindset. People want this? Prove it.
I don't think they plan to compete on DB backups storage market. So please don't mislead you potential customers.
Everyone can see how FAANG companies are growing wealthy off the mountains of data they are amassing, so everyone understands how data can be desirable. But what if your potential market base doesn't understand how to "drive" data - how to identify which data would be valuable for them and how best to exploit it? It seems to me that part of a go-to-market strategy needs, at least in the short term, to help potential customers transition from "that's a really shiny bauble" to "I understand how this is going to make me money."
The problem extends beyond medical research due to privacy laws like the GDPR. A participant or user must be able to delete their data not merely hide it so as to protect themselves from data breaches. Suggestions welcome.
Inside are a bunch of binary files. It would be interesting to know more about the on-disk layout of the stored tables.
I was not able to find any documentation. Does someone know more about this? Pointers would be appreciated.
This caused a firestorm, some defended him, others defended Bitkeeper, and a lot of people said why the hell is Linus using proprietary software to manage an Open Source project?!?!! Linus waded in and said he'd think about it, I think was on a thursday or friday, and by the next week he had working python prototype of git. [2] The rest is history. Bitkeeper faded into irrelevance and git became the lingua franca for open source projects. Arguably its biggest strength was not revision control, but being designed in manner that many collaborators could seamlessly commit changes for merging. Obviously architected to fulfill the time consuming requirements of Linus Torvalds, it has stood a test of time. I'm writing this from memory, so if it disagrees with Wikipedia take it with a grain of salt.
[1]: https://en.wikipedia.org/wiki/BitKeeper#Original_license_con... [2]: https://en.wikipedia.org/wiki/Git#History
https://marc.info/?l=git&m=111377572329534
I don't know about 'git branch', but it looks like 'git merge' wasn't a thing
edit: from searching a bit, it appears that it had branches on June of the launch year, dunno if it had those on release.
I'm not sure why you bring it up now. They don't call it "git for data" anywhere that I see, and it's missing 2 of the 3 core features that I think a "git for data" would need to have.
> Dolt is Git for data. Instead of versioning files, Dolt versions tables. DoltHub is a place on the internet to share Dolt repositories. As far as we can tell, Dolt is the only database with branches.
I find it hard to believe no other database has branches, but if that's true and if this product works like you'd imagine, that is really cool.
Given your historical observation, I think you're right that this will not lead to a market revolution, but sometimes you need the right product to change the landscape.
I think that a lot of the data VIP types we met with honestly wanted to know why they needed it, but the more they thought about it, the more it just seemed like a shiny thing.
It's telling that dozens of similar companies with smart people behind them have thrown their talents at this solution, and none of them have located the problem people are eager to pay to solve.
Dat is still around, on version 14, last update in git was 5 days ago.
There will be a new backend released soon that should really improve our ability to transfer and backup versioned data.
Compare and contrast it with the clarity of these introductions:
- https://git-lfs.github.com/ (Git Large File Storage)
- http://paulfitz.github.io/daff/ ("data diff for tables")
first "merge": https://git.kernel.org/pub/scm/git/git.git/commit/?id=33deb6...
first "tag": https://git.kernel.org/pub/scm/git/git.git/commit/?id=bf0c6e...
first "branch": https://git.kernel.org/pub/scm/git/git.git/commit/?id=74b242...
first Linus "branch" commit: https://git.kernel.org/pub/scm/git/git.git/commit/?id=e69a19...
Grant funders are starting to require teams to publish their code and data. Maybe they're the target audience? Data repos vendors could get on the list of approved vendors for teams receiving funding.
Coincidences, accidents, grudges, misunderstandings coupled with path dependencies.
https://www.youtube.com/watch?v=FX7qSwz3SCk (2013) - 'Introducing Dat: If Git Were Designed For Big Data' Talk by the founder.
My point is they pivoted and so maybe this idea won't work, or this was too early.
EDIT: Looking back on _your_ post, I mentioned it because you specifically said "It's a program, and it appears to be an open-source one you can download and use today." And that is what 'dat' is/was. I thought I would mention it.
Darcs 2 (introduced in 2008-04) reduces the name of scenarios that will trigger an exponential merge. Repositories created with Darcs 2 should have fewer exponential merges in practice.
http://darcs.net/FAQ/Performance#is-the-exponential-merge-pr...
BTW, I should have written it above, but dolthub/dolt is quite impressive. I hope you all make it, because it's a great product that I would love to use at work if I eventually shift back over to a data science position (right now, working as a software dev).
The best recommendation we have for that is that user's data should be encrypted with a key that's unique to the user, and when that user asks you to purge their data you should throw away the key. That means that even if two users have the same data it will be stored encrypted by different keys, so if one asks for the data to be purged the other can still keep their data.