LLMs can get "brain rot"

(llm-brain-rot.github.io)

466 points tamnd | 1 comments | 21 Oct 25 14:24 UTC | HN request time: 0.204s | source

Show context

andai ◴[21 Oct 25 18:04 UTC] No.45659285[source]▶

I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #

throwaway314155 ◴[21 Oct 25 18:16 UTC] No.45659477[source]▶

>>45659285 #

> But I assume the data cleaning process removes such content before pretraining? ;)

I didn't check what you're referring to but yes, the major providers likely have state of the art classifiers for censoring and filtering such content.

And when that doesn't work, they can RLHF the behavior from occurring.

You're trying to make some claim about garbage in/garbage out, but if there's even a tiny moat - it's in the filtering of these datasets and the purchasing of licenses to use other larger sources of data that (unlike Common Crawl) _aren't_ freely available for competition and open source movements to use.

replies(1): >>45667473 #

1. jedimastert ◴[22 Oct 25 11:25 UTC] No.45667473[source]▶

>>45659477 #

> purchasing of licenses to use other larger sources of data

https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...

↑