LLMs can get "brain rot"

I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.