Most active commenters
  • echelon(3)

←back to thread

LLMs can get "brain rot"

(llm-brain-rot.github.io)
466 points tamnd | 36 comments | | HN request time: 0.695s | source | bottom
1. pixelmelt ◴[] No.45657074[source]
Isn't this just garbage in garbage out with an attention grabbing title?
replies(6): >>45657153 #>>45657205 #>>45657394 #>>45657412 #>>45657896 #>>45658420 #
2. philipallstar ◴[] No.45657153[source]
Attention is all you need.
replies(2): >>45657800 #>>45658232 #
3. wat10000 ◴[] No.45657205[source]
Considering that the current state of the art for LLM training is to feed it massive amounts of garbage (with some good stuff alongside), it seems important to point this out even if it might seem obvious.
replies(1): >>45657247 #
4. CaptainOfCoit ◴[] No.45657247[source]
I don't think anyone is throwing raw datasets into LLMs and hoping for high quality weights anymore. Nowadays most of the datasets are filtered one way or another, and some of them highly curated even.
replies(1): >>45657546 #
5. Barrin92 ◴[] No.45657394[source]
Yes, I am concerned about the Computer Science profession

>"“Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI"

A metaphor is exactly what it is because not only do LLMs not possess human cognition, there's certainly no established science of thinking they're literally valid subjects for clinical psychological assessment.

How does this stuff get published, this is basically a blog post. One of the worse aspects of the whole AI craze is that is has turned a non-trivial amount of academia into a complete cargo cult joke.

replies(2): >>45657619 #>>45665127 #
6. otterley ◴[] No.45657412[source]
And with extra steps!
replies(1): >>45657847 #
7. BoredPositron ◴[] No.45657546{3}[source]
I doubt they are highly created you would need experts in every field to do so. Which gives me more performance anxiety for LLMs because one of the most curated fields should be code...
replies(3): >>45657692 #>>45657999 #>>45659279 #
8. bpt3 ◴[] No.45657619[source]
It is a blog post, it was published as a Github page and on arXiv.

I think it's intended as a catchy warning to people who are dumping every piece of the internet (and synthetic data based on it!) that there are repercussions.

replies(2): >>45657745 #>>45657952 #
9. nradov ◴[] No.45657692{4}[source]
OpenAI has been literally hiring human experts in certain targeted subject areas to write custom proprietary training content.
replies(1): >>45657779 #
10. pluc ◴[] No.45657745{3}[source]
I think it's an interesting line of thought. So we all adopt LLMs and use it everywhere we can. What happens to the next generation of humans, born with AI and with diminished cognitive capacity to even wonder about anything? What about the next generation? What happens to the next generation of AI models that can't train on original human-created datasets free of AI?
replies(1): >>45657838 #
11. BoredPositron ◴[] No.45657779{5}[source]
I bet the dataset is mostly comprised of certain areas™.
12. echelon ◴[] No.45657800[source]
In today's hyper saturated world, attention is everything:

- consumer marketing

- politics

- venture fundraising

When any system has a few power law winners, it makes sense to grab attention.

Look at Trump and Musk and now Altman. They figured it out.

MrBeast...

Attention, even if negative, wedges you into the system and everyone's awareness. Your mousey quiet competitors aren't even seen or acknowledged. The attention grabbers suck all the oxygen out of the room and win.

If you go back and look at any victory, was it really better solutions, or was it the fact that better solutions led to more attention?

"Look here" -> build consensus and ignore naysayers -> keep building -> feedback loop -> win

It might not just be a societal algorithm. It might be one of the universe's fundamental greedy optimization algorithms. It might underpin lots of systems, including how we ourselves as individuals think and learn.

Our pain receptors. Our own intellectual interests and hobbies. Children learning on the playground. Ant colonies. Bee swarms. The world is full of signals, and there are mechanisms which focus us on the right stimuli.

replies(4): >>45658156 #>>45658531 #>>45658557 #>>45660567 #
13. iwontberude ◴[] No.45657838{4}[source]
They will accept that their orders come from a terminal and they will follow them.
replies(1): >>45658384 #
14. Insanity ◴[] No.45657847[source]
Garbage in -> Magic -> Hallucinated Garbage out
15. icyfox ◴[] No.45657896[source]
Yes - garbage in / garbage out still holds true for most things when it comes to LLM training.

The two bits about this paper that I think are worth calling out specifically:

- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)

- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content

replies(1): >>45659231 #
16. gowld ◴[] No.45657952{3}[source]
arXiv is intended to host research papers, not a blog for researchers.

Letting researchers pollute it with blog-gunk is an abuse of the referral/vetting system for submitters.

17. groby_b ◴[] No.45657999{4}[source]
The major labs are hiring experts. They carefully build & curate synthetic data. The market for labelled non-synthetic data is currently ~$3B/year.

The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).

Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)

18. peterlk ◴[] No.45658156{3}[source]
You’re absolutely right!
19. dormento ◴[] No.45658232[source]
In case anyone missed the reference: https://arxiv.org/abs/1706.03762

> (...) We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

20. fragmede ◴[] No.45658384{5}[source]
Manna. https://marshallbrain.com/manna1
21. ashleyn ◴[] No.45658420[source]
Yes, but the idea of chatgpt slowly devolving into Skibidi Toilet and "6 7" references conjures a rather amusing image.
replies(1): >>45659715 #
22. ghurtado ◴[] No.45658531{3}[source]
Something flew approximately 10 miles above your head that would be a good idea for you to learn.
replies(2): >>45659181 #>>45660165 #
23. lawlessone ◴[] No.45658557{3}[source]
Is this copypasted from LinkedIn?
replies(2): >>45658905 #>>45660226 #
24. scubbo ◴[] No.45659181{4}[source]
There were plenty of kinder ways to let someone know that they had missed a reference - https://xkcd.com/1053/
25. satellite2 ◴[] No.45659231[source]
Yes but the other interesting bit which is not clearly addressed is that increasing the garbage in to 100% does not result in absolute garbage out. So visibly there is still something to learn there.
26. satellite2 ◴[] No.45659279{4}[source]
Is that right? Isn't the current way of doing thing to throw "everything" at it then fine tune?
27. 1121redblackgo ◴[] No.45659715[source]
6-7 ٩(●•)_
replies(1): >>45660350 #
28. echelon ◴[] No.45660165{4}[source]
What makes you think I didn't know the reference? That paper is seminal and essential reading in this space.

The intent was for you to read my comment at face value. I have a point tangential to the discussion at hand that is additive.

29. echelon ◴[] No.45660226{4}[source]
If you traverse back the fourteen years of my comment history (on this account - my other account is older), you'll find that I've always written prose in this form.

LLMs trained on me (and the Hacker News corpus), not the other way around.

30. stavros ◴[] No.45660350{3}[source]
Can someone explain this? I watched a South park episode that was all about this, but I'm not in the US so I have no idea what the reference is.
replies(1): >>45661091 #
31. alganet ◴[] No.45660567{3}[source]
You're not accounting for substrate saturation.

If you could just spam annoy until you win, we'd be all dancing to remixed versions of Macarena.

32. Sparkle-san ◴[] No.45661091{4}[source]
It's a meme without a lot of real meaning behind it. While it has its origins, I wouldn't say it's a "reference" to anything specific.

https://en.wikipedia.org/wiki/6-7_(meme)

replies(2): >>45661136 #>>45662449 #
33. stavros ◴[] No.45661136{5}[source]
Ahh, thanks, so it's just a thing kids say.
replies(1): >>45661388 #
34. 1121redblackgo ◴[] No.45661388{6}[source]
Yep
35. lexandstuff ◴[] No.45662449{5}[source]
It's a line from a banger Skrilla song, nothing more than that.
36. jll29 ◴[] No.45665127[source]
> How does this stuff get published

"published" only in the sense of "self-published on the Web". This manuscript has not or not yet been passed the peer review process, which is what scientist called "published" (properly).