LLMs can get "brain rot"

(llm-brain-rot.github.io)

466 points tamnd | 1 comments | 21 Oct 25 14:24 UTC | HN request time: 0.212s | source

Show context

andai ◴[21 Oct 25 18:04 UTC] No.45659285[source]▶

I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #

dist-epoch ◴[21 Oct 25 20:29 UTC] No.45661274[source]▶

>>45659285 #

Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.

replies(2): >>45662638 #>>45663097 #

jojobas ◴[21 Oct 25 23:30 UTC] No.45663097[source]▶

>>45661274 #

From the current WSJ front page:

Paul Ingrassia's 'Nazi Streak'

Musk Tosses Barbs at NASA Chie After SpaceX Criticism

Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

A Small North Carolina College Becomes a Magnet for Wealthy Students

Cracker Barrel CEO Explains Short-Lived Logo Change

If that's the benchmark for high quality training material we're in trouble.

replies(2): >>45663639 #>>45664716 #

1. stocksinsmocks ◴[22 Oct 25 00:40 UTC] No.45663639[source]▶

>>45663097 #

There is very, very little written work that will stand the test of time. Maybe the real bitter lesson is that training data quality is inversely proportional to scale and the technical capabilities exist but can never be realized

↑