LLMs can get "brain rot"

(llm-brain-rot.github.io)

466 points tamnd | 2 comments | 21 Oct 25 14:24 UTC | HN request time: 0.401s | source

Show context

andai ◴[21 Oct 25 18:04 UTC] No.45659285[source]▶

I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #

dist-epoch ◴[21 Oct 25 20:29 UTC] No.45661274[source]▶

>>45659285 #

Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.

replies(2): >>45662638 #>>45663097 #

1. andai ◴[21 Oct 25 22:39 UTC] No.45662638[source]▶

>>45661274 #

>Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information.

>The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all. You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out.

https://x.com/karpathy/status/1797313173449764933

Context: FineWeb-Edu, which used Llama 70B to [train a classifier to] filter FineWeb for quality, rejecting >90% of pages.

https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb...

replies(1): >>45665207 #

2. WA ◴[22 Oct 25 05:32 UTC] No.45665207[source]▶

>>45662638 (TP) #

Don‘t forget the terabytes of torrented ebooks.

https://www.tomshardware.com/tech-industry/artificial-intell...

https://www.classaction.org/news/1.5b-anthropic-settlement-e...

↑