←back to thread

LLMs can get "brain rot"

(llm-brain-rot.github.io)
466 points tamnd | 1 comments | | HN request time: 0.212s | source
Show context
andai ◴[] No.45659285[source]
I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #
dist-epoch ◴[] No.45661274[source]
Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.
replies(2): >>45662638 #>>45663097 #
jojobas ◴[] No.45663097[source]
From the current WSJ front page:

Paul Ingrassia's 'Nazi Streak'

Musk Tosses Barbs at NASA Chie After SpaceX Criticism

Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

A Small North Carolina College Becomes a Magnet for Wealthy Students

Cracker Barrel CEO Explains Short-Lived Logo Change

If that's the benchmark for high quality training material we're in trouble.

replies(2): >>45663639 #>>45664716 #
1. stocksinsmocks ◴[] No.45663639[source]
There is very, very little written work that will stand the test of time. Maybe the real bitter lesson is that training data quality is inversely proportional to scale and the technical capabilities exist but can never be realized