←back to thread

LLMs can get "brain rot"

(llm-brain-rot.github.io)
466 points tamnd | 1 comments | | HN request time: 0.204s | source
Show context
andai ◴[] No.45659285[source]
I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #
throwaway314155 ◴[] No.45659477[source]
> But I assume the data cleaning process removes such content before pretraining? ;)

I didn't check what you're referring to but yes, the major providers likely have state of the art classifiers for censoring and filtering such content.

And when that doesn't work, they can RLHF the behavior from occurring.

You're trying to make some claim about garbage in/garbage out, but if there's even a tiny moat - it's in the filtering of these datasets and the purchasing of licenses to use other larger sources of data that (unlike Common Crawl) _aren't_ freely available for competition and open source movements to use.

replies(1): >>45667473 #
1. jedimastert ◴[] No.45667473[source]
> purchasing of licenses to use other larger sources of data

https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...