Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge

(github.com)

752 points crazypython | 1 comments | 06 Mar 21 21:15 UTC | HN request time: 0.308s | source

Show context

zomglings ◴[06 Mar 21 23:56 UTC] No.26371706[source]▶

If anyone from the dolt team is reading this, I'd like to make an enquiry:

At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.

The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.

I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.

replies(5): >>26371719 #>>26371745 #>>26375077 #>>26383000 #>>26383043 #

StreamBright ◴[07 Mar 21 11:40 UTC] No.26375077[source]▶

>>26371706 #

You have other options too. If I have time i can try to reduce the size with a columnar format that is designed for this use case (repeated values, static dataset).

replies(1): >>26375914 #

zomglings ◴[07 Mar 21 14:05 UTC] No.26375914[source]▶

>>26375077 #

That would be really great. Let me know if there's any way we can help. Maybe if we released a small version of the dataset for testing/benchmarking and then I could take care of running the final processing on the full dataset?

replies(1): >>26377939 #

StreamBright ◴[07 Mar 21 17:49 UTC] No.26377939[source]▶

>>26375914 #

That would be amazing. I get back my internet tomorrow and i can play with the dataset see how much we could optimize.

replies(1): >>26378874 #

zomglings ◴[07 Mar 21 19:24 UTC] No.26378874[source]▶

>>26377939 #

Hi StreamBright - just published the development version of the dataset also to Kaggle: https://www.kaggle.com/simiotic/github-code-snippets-develop...

Compressed, it's 471 MB. Uncompressed, just a little more than 3 GB.

If you want to get in touch with me in a better way than HN comments two good options:

1. My email is in my profile

2. You can direct message me (@zomglings) on the Bugout community Slack: https://join.slack.com/t/bugout-dev/shared_invite/zt-fhepyt8...

Looking forward to collaborating with you. :)

replies(1): >>26385279 #

1. StreamBright ◴[08 Mar 21 13:08 UTC] No.26385279[source]▶

>>26378874 #

I have messaged you on Slack.

The preliminary tests show significant reduction in space usage when using Parquet over Sqlite3. This is not unexpected at all. Parquet is much better for analytical use.

- 2.9G Mar 8 08:37 snippets-dev.db

- 427M Mar 8 14:05 test1.parquet

(venv) snippets-dev sqlite3 -header -csv -readonly snippets-dev.db 'SELECT COUNT() FROM snippets;' COUNT() 4850000

(venv) snippets-dev python test.py test1.parquet 4850000

I can share the 3 lines of Python and 1 line for SQL I used to convert the dataset.

There was only weird thing that I encountered, you somehow have a new line character in the commit_hash field for every value.

↑