←back to thread

LLMs can get "brain rot"

(llm-brain-rot.github.io)
466 points tamnd | 1 comments | | HN request time: 0.198s | source
Show context
andai ◴[] No.45659285[source]
I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #
1. ◴[] No.45659453[source]