1 points r2d | 2 comments | | HN request time: 0.467s | source
1. r2d ◴[] No.45806005[source]
Hi HN! I built a tiny (<200 LOC) utility to make dataset management for machine learning easy.

I train small-ish machine learning models (<500M parameters) for protein generation, where the datasets are much less standardized than ImageNet or The Pile. Since we train on cloud compute a lot, we're constantly moving data on and off + making permanent changes to the dataset and the dataset elements themselves are all different sizes (instead of images being 256 x 256, different proteins are different lengths). picomap is a slightly spruced up version of some code I wrote last year that stores all your data in a single memory-mapped file. This makes dataset management simple, just push your dataset to your cloud compute. I find that for my usecases, this helps me keep the GPUs happy and fed and I normally don't even bother with the standard PyTorch Dataset + DataLoader. Feedback welcome!

(very inspired by mmap_ninja + the nanogpt data management)