←back to thread

203 points amazonhut | 2 comments | | HN request time: 0.467s | source
Show context
ndai ◴[] No.45248065[source]
I’m curious where you got your training data? I will look myself, but saw this and thought I’d ask. I have a CPU-first, no-backprop architecture that works very well on classification datasets. It can do single‑example incremental updates which might be useful for continuous learning. I made a toy demo to train on tiny.txt and it can predict next characters, but I’ve never tried to make an LLM before. I think my architecture might work well as an on-device assistant or for on-premises needs, but I want to work with it more before I embarrass myself. Any open-source LLM training datasets you would recommend?
replies(2): >>45248077 #>>45248081 #
1. electroglyph ◴[] No.45248081[source]
https://huggingface.co/datasets/NousResearch/Hermes-3-Datase...
replies(1): >>45248447 #
2. Snuggly73 ◴[] No.45248447[source]
To my untrained eye, this looks more like an instruct dataset.

For just plain text, I really like this one - https://huggingface.co/datasets/roneneldan/TinyStories