Most active commenters

    ←back to thread

    LLMs can get "brain rot"

    (llm-brain-rot.github.io)
    466 points tamnd | 12 comments | | HN request time: 0.001s | source | bottom
    1. andai ◴[] No.45659285[source]
    I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

    https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

    I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

    Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

    replies(3): >>45659453 #>>45659477 #>>45661274 #
    2. ◴[] No.45659453[source]
    3. throwaway314155 ◴[] No.45659477[source]
    > But I assume the data cleaning process removes such content before pretraining? ;)

    I didn't check what you're referring to but yes, the major providers likely have state of the art classifiers for censoring and filtering such content.

    And when that doesn't work, they can RLHF the behavior from occurring.

    You're trying to make some claim about garbage in/garbage out, but if there's even a tiny moat - it's in the filtering of these datasets and the purchasing of licenses to use other larger sources of data that (unlike Common Crawl) _aren't_ freely available for competition and open source movements to use.

    replies(1): >>45667473 #
    4. dist-epoch ◴[] No.45661274[source]
    Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.
    replies(2): >>45662638 #>>45663097 #
    5. andai ◴[] No.45662638[source]
    >Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information.

    >The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all. You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out.

    https://x.com/karpathy/status/1797313173449764933

    Context: FineWeb-Edu, which used Llama 70B to [train a classifier to] filter FineWeb for quality, rejecting >90% of pages.

    https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb...

    replies(1): >>45665207 #
    6. jojobas ◴[] No.45663097[source]
    From the current WSJ front page:

    Paul Ingrassia's 'Nazi Streak'

    Musk Tosses Barbs at NASA Chie After SpaceX Criticism

    Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

    A Small North Carolina College Becomes a Magnet for Wealthy Students

    Cracker Barrel CEO Explains Short-Lived Logo Change

    If that's the benchmark for high quality training material we're in trouble.

    replies(2): >>45663639 #>>45664716 #
    7. stocksinsmocks ◴[] No.45663639{3}[source]
    There is very, very little written work that will stand the test of time. Maybe the real bitter lesson is that training data quality is inversely proportional to scale and the technical capabilities exist but can never be realized
    8. anigbrowl ◴[] No.45664716{3}[source]
    In general I find WSJ articles very well written. It's not their fault if much of today's news is about clowns.
    replies(1): >>45665103 #
    9. dclowd9901 ◴[] No.45665103{4}[source]
    Their editorial department is an embarrassment imo. Sycophancy for conservatism thinly veiled as intellectualism.
    replies(1): >>45665597 #
    10. WA ◴[] No.45665207{3}[source]
    Don‘t forget the terabytes of torrented ebooks.

    https://www.tomshardware.com/tech-industry/artificial-intell...

    https://www.classaction.org/news/1.5b-anthropic-settlement-e...

    11. anigbrowl ◴[] No.45665597{5}[source]
    I also hate their editorial department, I'm just saying that the news articles are well written in a technical sense rather than because I like their editorial positions or choice of subject mattter.
    12. jedimastert ◴[] No.45667473[source]
    > purchasing of licenses to use other larger sources of data

    https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...