←back to thread

LLMs can get "brain rot"

(llm-brain-rot.github.io)
466 points tamnd | 3 comments | | HN request time: 0s | source
Show context
andai ◴[] No.45659285[source]
I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #
dist-epoch ◴[] No.45661274[source]
Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.
replies(2): >>45662638 #>>45663097 #
jojobas ◴[] No.45663097[source]
From the current WSJ front page:

Paul Ingrassia's 'Nazi Streak'

Musk Tosses Barbs at NASA Chie After SpaceX Criticism

Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

A Small North Carolina College Becomes a Magnet for Wealthy Students

Cracker Barrel CEO Explains Short-Lived Logo Change

If that's the benchmark for high quality training material we're in trouble.

replies(2): >>45663639 #>>45664716 #
1. anigbrowl ◴[] No.45664716[source]
In general I find WSJ articles very well written. It's not their fault if much of today's news is about clowns.
replies(1): >>45665103 #
2. dclowd9901 ◴[] No.45665103[source]
Their editorial department is an embarrassment imo. Sycophancy for conservatism thinly veiled as intellectualism.
replies(1): >>45665597 #
3. anigbrowl ◴[] No.45665597[source]
I also hate their editorial department, I'm just saying that the news articles are well written in a technical sense rather than because I like their editorial positions or choice of subject mattter.