LLMs can get "brain rot"

1. andai ◴[21 Oct 25 18:04 UTC] No.45659285[source]▶

I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #

2. ◴[21 Oct 25 18:14 UTC] No.45659453[source]▶

>>45659285 (TP) #

3. throwaway314155 ◴[21 Oct 25 18:16 UTC] No.45659477[source]▶

>>45659285 (TP) #

> But I assume the data cleaning process removes such content before pretraining? ;)

I didn't check what you're referring to but yes, the major providers likely have state of the art classifiers for censoring and filtering such content.

And when that doesn't work, they can RLHF the behavior from occurring.

You're trying to make some claim about garbage in/garbage out, but if there's even a tiny moat - it's in the filtering of these datasets and the purchasing of licenses to use other larger sources of data that (unlike Common Crawl) _aren't_ freely available for competition and open source movements to use.

replies(1): >>45667473 #

4. dist-epoch ◴[21 Oct 25 20:29 UTC] No.45661274[source]▶

>>45659285 (TP) #

Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.

replies(2): >>45662638 #>>45663097 #

5. andai ◴[21 Oct 25 22:39 UTC] No.45662638[source]▶

>>45661274 #

>Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information.

>The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all. You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out.

https://x.com/karpathy/status/1797313173449764933

Context: FineWeb-Edu, which used Llama 70B to [train a classifier to] filter FineWeb for quality, rejecting >90% of pages.

https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb...

replies(1): >>45665207 #

6. jojobas ◴[21 Oct 25 23:30 UTC] No.45663097[source]▶

>>45661274 #

From the current WSJ front page:

Paul Ingrassia's 'Nazi Streak'

Musk Tosses Barbs at NASA Chie After SpaceX Criticism

Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

A Small North Carolina College Becomes a Magnet for Wealthy Students

Cracker Barrel CEO Explains Short-Lived Logo Change

If that's the benchmark for high quality training material we're in trouble.

replies(2): >>45663639 #>>45664716 #

7. stocksinsmocks ◴[22 Oct 25 00:40 UTC] No.45663639{3}[source]▶

>>45663097 #

There is very, very little written work that will stand the test of time. Maybe the real bitter lesson is that training data quality is inversely proportional to scale and the technical capabilities exist but can never be realized

8. anigbrowl ◴[22 Oct 25 03:45 UTC] No.45664716{3}[source]▶

>>45663097 #

In general I find WSJ articles very well written. It's not their fault if much of today's news is about clowns.

replies(1): >>45665103 #

9. dclowd9901 ◴[22 Oct 25 05:12 UTC] No.45665103{4}[source]▶

>>45664716 #

Their editorial department is an embarrassment imo. Sycophancy for conservatism thinly veiled as intellectualism.

replies(1): >>45665597 #

10. WA ◴[22 Oct 25 05:32 UTC] No.45665207{3}[source]▶

>>45662638 #

Don‘t forget the terabytes of torrented ebooks.

https://www.tomshardware.com/tech-industry/artificial-intell...

https://www.classaction.org/news/1.5b-anthropic-settlement-e...

11. anigbrowl ◴[22 Oct 25 06:38 UTC] No.45665597{5}[source]▶

>>45665103 #

I also hate their editorial department, I'm just saying that the news articles are well written in a technical sense rather than because I like their editorial positions or choice of subject mattter.

12. jedimastert ◴[22 Oct 25 11:25 UTC] No.45667473[source]▶

>>45659477 #

> purchasing of licenses to use other larger sources of data

https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...