LLMs can get "brain rot"

1. pixelmelt ◴[21 Oct 25 15:33 UTC] No.45657074[source]▶

>>45656223 (OP) #

Isn't this just garbage in garbage out with an attention grabbing title?

replies(6): >>45657153 #>>45657205 #>>45657394 #>>45657412 #>>45657896 #>>45658420 #

2. philipallstar ◴[21 Oct 25 15:39 UTC] No.45657153[source]▶

>>45657074 (TP) #

Attention is all you need.

replies(2): >>45657800 #>>45658232 #

3. wat10000 ◴[21 Oct 25 15:44 UTC] No.45657205[source]▶

>>45657074 (TP) #

Considering that the current state of the art for LLM training is to feed it massive amounts of garbage (with some good stuff alongside), it seems important to point this out even if it might seem obvious.

replies(1): >>45657247 #

4. CaptainOfCoit ◴[21 Oct 25 15:48 UTC] No.45657247[source]▶

>>45657205 #

I don't think anyone is throwing raw datasets into LLMs and hoping for high quality weights anymore. Nowadays most of the datasets are filtered one way or another, and some of them highly curated even.

replies(1): >>45657546 #

5. Barrin92 ◴[21 Oct 25 15:59 UTC] No.45657394[source]▶

>>45657074 (TP) #

Yes, I am concerned about the Computer Science profession

>"“Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI"

A metaphor is exactly what it is because not only do LLMs not possess human cognition, there's certainly no established science of thinking they're literally valid subjects for clinical psychological assessment.

How does this stuff get published, this is basically a blog post. One of the worse aspects of the whole AI craze is that is has turned a non-trivial amount of academia into a complete cargo cult joke.

replies(2): >>45657619 #>>45665127 #

6. otterley ◴[21 Oct 25 16:00 UTC] No.45657412[source]▶

>>45657074 (TP) #

And with extra steps!

replies(1): >>45657847 #

7. BoredPositron ◴[21 Oct 25 16:10 UTC] No.45657546{3}[source]▶

>>45657247 #

I doubt they are highly created you would need experts in every field to do so. Which gives me more performance anxiety for LLMs because one of the most curated fields should be code...

replies(3): >>45657692 #>>45657999 #>>45659279 #

8. bpt3 ◴[21 Oct 25 16:16 UTC] No.45657619[source]▶

>>45657394 #

It is a blog post, it was published as a Github page and on arXiv.

I think it's intended as a catchy warning to people who are dumping every piece of the internet (and synthetic data based on it!) that there are repercussions.

replies(2): >>45657745 #>>45657952 #

9. nradov ◴[21 Oct 25 16:21 UTC] No.45657692{4}[source]▶

>>45657546 #

OpenAI has been literally hiring human experts in certain targeted subject areas to write custom proprietary training content.

replies(1): >>45657779 #

10. pluc ◴[21 Oct 25 16:25 UTC] No.45657745{3}[source]▶

>>45657619 #

I think it's an interesting line of thought. So we all adopt LLMs and use it everywhere we can. What happens to the next generation of humans, born with AI and with diminished cognitive capacity to even wonder about anything? What about the next generation? What happens to the next generation of AI models that can't train on original human-created datasets free of AI?

replies(1): >>45657838 #

11. BoredPositron ◴[21 Oct 25 16:27 UTC] No.45657779{5}[source]▶

>>45657692 #

I bet the dataset is mostly comprised of certain areas™.

12. echelon ◴[21 Oct 25 16:29 UTC] No.45657800[source]▶

>>45657153 #

In today's hyper saturated world, attention is everything:

- consumer marketing

- politics

- venture fundraising

When any system has a few power law winners, it makes sense to grab attention.

Look at Trump and Musk and now Altman. They figured it out.

MrBeast...

Attention, even if negative, wedges you into the system and everyone's awareness. Your mousey quiet competitors aren't even seen or acknowledged. The attention grabbers suck all the oxygen out of the room and win.

If you go back and look at any victory, was it really better solutions, or was it the fact that better solutions led to more attention?

"Look here" -> build consensus and ignore naysayers -> keep building -> feedback loop -> win

It might not just be a societal algorithm. It might be one of the universe's fundamental greedy optimization algorithms. It might underpin lots of systems, including how we ourselves as individuals think and learn.

Our pain receptors. Our own intellectual interests and hobbies. Children learning on the playground. Ant colonies. Bee swarms. The world is full of signals, and there are mechanisms which focus us on the right stimuli.

replies(4): >>45658156 #>>45658531 #>>45658557 #>>45660567 #

13. iwontberude ◴[21 Oct 25 16:32 UTC] No.45657838{4}[source]▶

>>45657745 #

They will accept that their orders come from a terminal and they will follow them.

replies(1): >>45658384 #

14. Insanity ◴[21 Oct 25 16:33 UTC] No.45657847[source]▶

>>45657412 #

Garbage in -> Magic -> Hallucinated Garbage out

15. icyfox ◴[21 Oct 25 16:38 UTC] No.45657896[source]▶

>>45657074 (TP) #

Yes - garbage in / garbage out still holds true for most things when it comes to LLM training.

The two bits about this paper that I think are worth calling out specifically:

- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)

- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content

replies(1): >>45659231 #

16. gowld ◴[21 Oct 25 16:42 UTC] No.45657952{3}[source]▶

>>45657619 #

arXiv is intended to host research papers, not a blog for researchers.

Letting researchers pollute it with blog-gunk is an abuse of the referral/vetting system for submitters.

17. groby_b ◴[21 Oct 25 16:46 UTC] No.45657999{4}[source]▶

>>45657546 #

The major labs are hiring experts. They carefully build & curate synthetic data. The market for labelled non-synthetic data is currently ~$3B/year.

The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).

Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)

18. peterlk ◴[21 Oct 25 16:56 UTC] No.45658156{3}[source]▶

>>45657800 #

You’re absolutely right!

19. dormento ◴[21 Oct 25 17:01 UTC] No.45658232[source]▶

>>45657153 #

In case anyone missed the reference: https://arxiv.org/abs/1706.03762

> (...) We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

20. fragmede ◴[21 Oct 25 17:11 UTC] No.45658384{5}[source]▶

>>45657838 #

Manna. https://marshallbrain.com/manna1

21. ashleyn ◴[21 Oct 25 17:13 UTC] No.45658420[source]▶

>>45657074 (TP) #

Yes, but the idea of chatgpt slowly devolving into Skibidi Toilet and "6 7" references conjures a rather amusing image.

replies(1): >>45659715 #

22. ghurtado ◴[21 Oct 25 17:21 UTC] No.45658531{3}[source]▶

>>45657800 #

Something flew approximately 10 miles above your head that would be a good idea for you to learn.

replies(2): >>45659181 #>>45660165 #

23. lawlessone ◴[21 Oct 25 17:22 UTC] No.45658557{3}[source]▶

>>45657800 #

Is this copypasted from LinkedIn?

replies(2): >>45658905 #>>45660226 #

24. scubbo ◴[21 Oct 25 17:58 UTC] No.45659181{4}[source]▶

>>45658531 #

There were plenty of kinder ways to let someone know that they had missed a reference - https://xkcd.com/1053/

25. satellite2 ◴[21 Oct 25 18:01 UTC] No.45659231[source]▶

>>45657896 #

Yes but the other interesting bit which is not clearly addressed is that increasing the garbage in to 100% does not result in absolute garbage out. So visibly there is still something to learn there.

26. satellite2 ◴[21 Oct 25 18:04 UTC] No.45659279{4}[source]▶

>>45657546 #

Is that right? Isn't the current way of doing thing to throw "everything" at it then fine tune?

27. 1121redblackgo ◴[21 Oct 25 18:33 UTC] No.45659715[source]▶

>>45658420 #

6-7 ٩(●•)_

replies(1): >>45660350 #

28. echelon ◴[21 Oct 25 19:04 UTC] No.45660165{4}[source]▶

>>45658531 #

What makes you think I didn't know the reference? That paper is seminal and essential reading in this space.

The intent was for you to read my comment at face value. I have a point tangential to the discussion at hand that is additive.

29. echelon ◴[21 Oct 25 19:09 UTC] No.45660226{4}[source]▶

>>45658557 #

If you traverse back the fourteen years of my comment history (on this account - my other account is older), you'll find that I've always written prose in this form.

LLMs trained on me (and the Hacker News corpus), not the other way around.

30. stavros ◴[21 Oct 25 19:18 UTC] No.45660350{3}[source]▶

>>45659715 #

Can someone explain this? I watched a South park episode that was all about this, but I'm not in the US so I have no idea what the reference is.

replies(1): >>45661091 #

31. alganet ◴[21 Oct 25 19:32 UTC] No.45660567{3}[source]▶

>>45657800 #

You're not accounting for substrate saturation.

If you could just spam annoy until you win, we'd be all dancing to remixed versions of Macarena.

32. Sparkle-san ◴[21 Oct 25 20:15 UTC] No.45661091{4}[source]▶

>>45660350 #

It's a meme without a lot of real meaning behind it. While it has its origins, I wouldn't say it's a "reference" to anything specific.

https://en.wikipedia.org/wiki/6-7_(meme)

replies(2): >>45661136 #>>45662449 #

33. stavros ◴[21 Oct 25 20:18 UTC] No.45661136{5}[source]▶

>>45661091 #

Ahh, thanks, so it's just a thing kids say.

replies(1): >>45661388 #

34. 1121redblackgo ◴[21 Oct 25 20:38 UTC] No.45661388{6}[source]▶

>>45661136 #

Yep

35. lexandstuff ◴[21 Oct 25 22:20 UTC] No.45662449{5}[source]▶

>>45661091 #

It's a line from a banger Skrilla song, nothing more than that.

36. jll29 ◴[22 Oct 25 05:17 UTC] No.45665127[source]▶

>>45657394 #

> How does this stuff get published

"published" only in the sense of "self-published on the Web". This manuscript has not or not yet been passed the peer review process, which is what scientist called "published" (properly).