>Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information.
>The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all. You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out.
https://x.com/karpathy/status/1797313173449764933
Context: FineWeb-Edu, which used Llama 70B to [train a classifier to] filter FineWeb for quality, rejecting >90% of pages.
https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb...