Most active commenters

kragen(10)
(9)
JumpCrisscross(6)
rriley(4)
grey-area(4)
landdate(4)
echelon(3)
avazhi(3)
moritzwarhier(3)
antegamisou(3)

Popular/hot comments

>>45660532 #
>>45660736 #
>>45658899 #
>>45658886 #
>>45667269 #
>>45657074 #
>>45657800 #
>>45656299 #
>>45658969 #
>>45663703 #
>>45662559 #
>>45660648 #
>>45657546 #
>>45660816 #
>>45659454 #
>>45663447 #
>>45663533 #
>>45663537 #
>>45659285 #
>>45665703 #

LLMs can get "brain rot"

(llm-brain-rot.github.io)

1. AznHisoka ◴[21 Oct 25 14:30 UTC] No.45656299[source]▶

>>45656223 (OP) #

Can someone explain this in laymen terms?

replies(4): >>45656501 #>>45657077 #>>45658026 #>>45666082 #

2. PaulHoule ◴[21 Oct 25 14:47 UTC] No.45656501[source]▶

>>45656299 #

They benchmark two different feeds of dangerous tweets:

  (1) a feed of the most popular tweets based on likes, retweets, and such
  (2) an algorithmic feed that looks for clickbait in the text

and blend these in different proportions to a feed of random tweets that are not popular nor clickbait and find that feed (1) has more of damaging effect on the performance of chatbots. That is, they feed that blend of tweets into the model and then they ask the models to do things and get worse outcomes.

replies(1): >>45657029 #

3. ForHackernews ◴[21 Oct 25 15:29 UTC] No.45657029{3}[source]▶

>>45656501 #

Blended in how? To the training set?

replies(1): >>45660602 #

4. pixelmelt ◴[21 Oct 25 15:33 UTC] No.45657074[source]▶

>>45656223 (OP) #

Isn't this just garbage in garbage out with an attention grabbing title?

replies(6): >>45657153 #>>45657205 #>>45657394 #>>45657412 #>>45657896 #>>45658420 #

5. sailingparrot ◴[21 Oct 25 15:33 UTC] No.45657077[source]▶

>>45656299 #

train on bad data, get a bad model

replies(1): >>45660161 #

6. CaptainOfCoit ◴[21 Oct 25 15:37 UTC] No.45657123[source]▶

>>45656223 (OP) #

> continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs).

TLDR: If your data set is junk, your trained model/weights will probably be junk too.

7. philipallstar ◴[21 Oct 25 15:39 UTC] No.45657153[source]▶

>>45657074 #

Attention is all you need.

replies(2): >>45657800 #>>45658232 #

8. b0gb ◴[21 Oct 25 15:41 UTC] No.45657176[source]▶

>>45656223 (OP) #

AIs need supervision, just like regular people... /s

9. wat10000 ◴[21 Oct 25 15:44 UTC] No.45657205[source]▶

>>45657074 #

Considering that the current state of the art for LLM training is to feed it massive amounts of garbage (with some good stuff alongside), it seems important to point this out even if it might seem obvious.

replies(1): >>45657247 #

10. CaptainOfCoit ◴[21 Oct 25 15:48 UTC] No.45657247{3}[source]▶

>>45657205 #

I don't think anyone is throwing raw datasets into LLMs and hoping for high quality weights anymore. Nowadays most of the datasets are filtered one way or another, and some of them highly curated even.

replies(1): >>45657546 #

11. Barrin92 ◴[21 Oct 25 15:59 UTC] No.45657394[source]▶

>>45657074 #

Yes, I am concerned about the Computer Science profession

>"“Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI"

A metaphor is exactly what it is because not only do LLMs not possess human cognition, there's certainly no established science of thinking they're literally valid subjects for clinical psychological assessment.

How does this stuff get published, this is basically a blog post. One of the worse aspects of the whole AI craze is that is has turned a non-trivial amount of academia into a complete cargo cult joke.

replies(2): >>45657619 #>>45665127 #

12. thelastgallon ◴[21 Oct 25 16:00 UTC] No.45657408[source]▶

>>45656223 (OP) #

If most of the content produced by younger generations is about skibidi toilet[1] and 67[2], isn't that what LLMs are going to be trained on?

[1] https://en.wikipedia.org/wiki/Skibidi_Toilet

[2] https://en.wikipedia.org/wiki/6-7_(meme)

replies(1): >>45658248 #

13. otterley ◴[21 Oct 25 16:00 UTC] No.45657412[source]▶

>>45657074 #

And with extra steps!

replies(1): >>45657847 #

14. BoredPositron ◴[21 Oct 25 16:10 UTC] No.45657546{4}[source]▶

>>45657247 #

I doubt they are highly created you would need experts in every field to do so. Which gives me more performance anxiety for LLMs because one of the most curated fields should be code...

replies(3): >>45657692 #>>45657999 #>>45659279 #

15. Isamu ◴[21 Oct 25 16:14 UTC] No.45657590[source]▶

>>45656223 (OP) #

Another analogy to help us understand that LLMs are a useful part of what people do but are wildly misconstrued as the whole story

16. bpt3 ◴[21 Oct 25 16:16 UTC] No.45657619{3}[source]▶

>>45657394 #

It is a blog post, it was published as a Github page and on arXiv.

I think it's intended as a catchy warning to people who are dumping every piece of the internet (and synthetic data based on it!) that there are repercussions.

replies(2): >>45657745 #>>45657952 #

17. nradov ◴[21 Oct 25 16:21 UTC] No.45657692{5}[source]▶

>>45657546 #

OpenAI has been literally hiring human experts in certain targeted subject areas to write custom proprietary training content.

replies(1): >>45657779 #

18. pluc ◴[21 Oct 25 16:25 UTC] No.45657745{4}[source]▶

>>45657619 #

I think it's an interesting line of thought. So we all adopt LLMs and use it everywhere we can. What happens to the next generation of humans, born with AI and with diminished cognitive capacity to even wonder about anything? What about the next generation? What happens to the next generation of AI models that can't train on original human-created datasets free of AI?

replies(1): >>45657838 #

19. BoredPositron ◴[21 Oct 25 16:27 UTC] No.45657779{6}[source]▶

>>45657692 #

I bet the dataset is mostly comprised of certain areas™.

20. bbstats ◴[21 Oct 25 16:29 UTC] No.45657798[source]▶

>>45656223 (OP) #

making a model worse is very easy.

21. echelon ◴[21 Oct 25 16:29 UTC] No.45657800{3}[source]▶

>>45657153 #

In today's hyper saturated world, attention is everything:

- consumer marketing

- politics

- venture fundraising

When any system has a few power law winners, it makes sense to grab attention.

Look at Trump and Musk and now Altman. They figured it out.

MrBeast...

Attention, even if negative, wedges you into the system and everyone's awareness. Your mousey quiet competitors aren't even seen or acknowledged. The attention grabbers suck all the oxygen out of the room and win.

If you go back and look at any victory, was it really better solutions, or was it the fact that better solutions led to more attention?

"Look here" -> build consensus and ignore naysayers -> keep building -> feedback loop -> win

It might not just be a societal algorithm. It might be one of the universe's fundamental greedy optimization algorithms. It might underpin lots of systems, including how we ourselves as individuals think and learn.

Our pain receptors. Our own intellectual interests and hobbies. Children learning on the playground. Ant colonies. Bee swarms. The world is full of signals, and there are mechanisms which focus us on the right stimuli.

replies(4): >>45658156 #>>45658531 #>>45658557 #>>45660567 #

22. iwontberude ◴[21 Oct 25 16:32 UTC] No.45657838{5}[source]▶

>>45657745 #

They will accept that their orders come from a terminal and they will follow them.

replies(1): >>45658384 #

23. Insanity ◴[21 Oct 25 16:33 UTC] No.45657847{3}[source]▶

>>45657412 #

Garbage in -> Magic -> Hallucinated Garbage out

24. moffkalast ◴[21 Oct 25 16:36 UTC] No.45657869[source]▶

>>45656223 (OP) #

Ah yes, something the local LLM fine tuning community figured out how to do in creative ways as soon as llama 1 released. I'm glad it has a name.

25. icyfox ◴[21 Oct 25 16:38 UTC] No.45657896[source]▶

>>45657074 #

Yes - garbage in / garbage out still holds true for most things when it comes to LLM training.

The two bits about this paper that I think are worth calling out specifically:

- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)

- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content

replies(1): >>45659231 #

26. killshotroxs ◴[21 Oct 25 16:38 UTC] No.45657901[source]▶

>>45656223 (OP) #

If only I got money every time my LLM kept looping answers and telling stuff I didn't even need. Just recently, I was stuck with LLM answers, all while it wouldn't even detect simple syntax errors...

27. rriley ◴[21 Oct 25 16:40 UTC] No.45657938[source]▶

>>45656223 (OP) #

This paper makes me wonder the long lasting effects of the current media consumption patterns by the alpha-gen kids.

replies(1): >>45657948 #

28. AznHisoka ◴[21 Oct 25 16:42 UTC] No.45657948[source]▶

>>45657938 #

why just kids?

replies(1): >>45658012 #

29. gowld ◴[21 Oct 25 16:42 UTC] No.45657952{4}[source]▶

>>45657619 #

arXiv is intended to host research papers, not a blog for researchers.

Letting researchers pollute it with blog-gunk is an abuse of the referral/vetting system for submitters.

30. groby_b ◴[21 Oct 25 16:46 UTC] No.45657999{5}[source]▶

>>45657546 #

The major labs are hiring experts. They carefully build & curate synthetic data. The market for labelled non-synthetic data is currently ~$3B/year.

The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).

Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)

31. rriley ◴[21 Oct 25 16:46 UTC] No.45658012{3}[source]▶

>>45657948 #

I am mostly concerned with the irreversibility part. More developed brains probably would not be affected as much.

replies(2): >>45658207 #>>45658430 #

32. rriley ◴[21 Oct 25 16:47 UTC] No.45658026[source]▶

>>45656299 #

The study introduces the "LLM Brain Rot Hypothesis," asserting that large language models (LLMs) experience cognitive decline when continuously exposed to low-quality, engaging content, such as sensationalized social media posts. This decline, evident in diminished reasoning, long-context understanding, and ethical norms, highlights the critical need for careful data curation and quality control in LLM training. The findings suggest that standard mitigation strategies are insufficient, urging stakeholders to implement routine cognitive health assessments to maintain LLM effectiveness over time.

TL;DR from https://unrav.io/#view/8f20da5f8205c54b5802c2b623702569

33. peterlk ◴[21 Oct 25 16:56 UTC] No.45658156{4}[source]▶

>>45657800 #

You’re absolutely right!

34. jama211 ◴[21 Oct 25 17:00 UTC] No.45658207{4}[source]▶

>>45658012 #

Have you opened facebook recently? Seems the older folk are plenty affected to me.

replies(2): >>45658545 #>>45660509 #

35. dormento ◴[21 Oct 25 17:01 UTC] No.45658232{3}[source]▶

>>45657153 #

In case anyone missed the reference: https://arxiv.org/abs/1706.03762

> (...) We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

36. micromacrofoot ◴[21 Oct 25 17:02 UTC] No.45658248[source]▶

>>45657408 #

only if the trends last long enough (which they rarely do!), skibidi is already old news according to some kids I know

replies(1): >>45660794 #

37. fragmede ◴[21 Oct 25 17:11 UTC] No.45658384{6}[source]▶

>>45657838 #

Manna. https://marshallbrain.com/manna1

38. ashleyn ◴[21 Oct 25 17:13 UTC] No.45658420[source]▶

>>45657074 #

Yes, but the idea of chatgpt slowly devolving into Skibidi Toilet and "6 7" references conjures a rather amusing image.

replies(1): >>45659715 #

39. vanderZwan ◴[21 Oct 25 17:14 UTC] No.45658430{4}[source]▶

>>45658012 #

I recently saw an article about the history of Sesame Street that claimed that in the late 1960s American preschool kids watched around twenty-seven hours of television per week on average[0]. And most of that was not age-appropriate (education TV had yet to be invented). So maybe we should check in on the boomers too if we're sincere about these worries.

[0] https://books.google.se/books?id=KOUCAAAAMBAJ&pg=PA48&vq=ses...

replies(1): >>45659584 #

40. ghurtado ◴[21 Oct 25 17:21 UTC] No.45658531{4}[source]▶

>>45657800 #

Something flew approximately 10 miles above your head that would be a good idea for you to learn.

replies(2): >>45659181 #>>45660165 #

41. FactolSarin ◴[21 Oct 25 17:22 UTC] No.45658545{5}[source]▶

>>45658207 #

But don't worry, us middle aged people are definitely immune.

replies(1): >>45673030 #

42. lawlessone ◴[21 Oct 25 17:22 UTC] No.45658557{4}[source]▶

>>45657800 #

Is this copypasted from LinkedIn?

replies(2): >>45658905 #>>45660226 #

43. avazhi ◴[21 Oct 25 17:41 UTC] No.45658886[source]▶

>>45656223 (OP) #

“Studying “Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI, guiding how we source, filter, and maintain training corpora so deployed systems stay sharp, reliable, and aligned over time.”

An LLM-written line if I’ve ever seen one. Looks like the authors have their own brainrot to contend with.

replies(12): >>45658899 #>>45660532 #>>45661492 #>>45662138 #>>45662241 #>>45664417 #>>45664474 #>>45665028 #>>45668042 #>>45670485 #>>45670910 #>>45671621 #

44. askafriend ◴[21 Oct 25 17:41 UTC] No.45658899[source]▶

>>45658886 #

If it conveys the intended information then what's wrong with that? You're fighting a tsunami here. People are going to use LLMs to help their writing now and forever.

replies(12): >>45658936 #>>45658977 #>>45658987 #>>45659011 #>>45660194 #>>45660255 #>>45660793 #>>45660811 #>>45661637 #>>45662211 #>>45662724 #>>45663177 #

45. binary132 ◴[21 Oct 25 17:43 UTC] No.45658936{3}[source]▶

>>45658899 #

The brainrot apologists have arrived

replies(1): >>45658969 #

46. askafriend ◴[21 Oct 25 17:45 UTC] No.45658969{4}[source]▶

>>45658936 #

Why shouldn't the author use LLMs to assist their writing?

The issue is how tools are used, not that they are used at all.

replies(4): >>45660277 #>>45661374 #>>45661646 #>>45662249 #

47. avazhi ◴[21 Oct 25 17:46 UTC] No.45658977{3}[source]▶

>>45658899 #

If you can’t understand the irony inherent in getting an LLM to write about LLM brainrot, itself an analog for human brainrot that arises by the habitual non use of the human brain, then I’m not sure what to tell you.

Whether it’s a tsunami and whether most people will do it has no relevance to my expectation that researchers of LLMs and brainrot shouldn’t outsource their own thinking and creativity to an LLM in a paper that itself implies that using LLMs causes brainrot.

replies(2): >>45659104 #>>45659116 #

48. gaogao ◴[21 Oct 25 17:46 UTC] No.45658984[source]▶

>>45656223 (OP) #

Brain rot texts seems reasonably harmful, but brain rot videos are often surreal and semantically dense in a way that probably improves performance (such as discussed on this German brain rot analysis https://www.youtube.com/watch?v=-mJENuEN_rs&t=37s). For example, Švankmajer is basically proto-brainrot, but is also the sort of thing you'd watch in a museum and think about.

Basically, I think the brain rot aspect might be a bit of terminology distraction here, when it seems what they're measuring is whether it's a puff piece or dense.

replies(2): >>45659079 #>>45659097 #

49. moritzwarhier ◴[21 Oct 25 17:46 UTC] No.45658987{3}[source]▶

>>45658899 #

What information is conveyed by this sentence?

Seems like none to me.

50. uludag ◴[21 Oct 25 17:47 UTC] No.45659011{3}[source]▶

>>45658899 #

Nothing wrong with using LLMs—until every paragraph sounds like it’s A/B tested for LinkedIn virality. That’s the rot setting in.

The problem isn’t using AI—it’s sounding like AI trying to impress a marketing department. That’s when you know the loop’s closed.

replies(1): >>45659257 #

51. f_devd ◴[21 Oct 25 17:51 UTC] No.45659079[source]▶

>>45658984 #

I do not think this is the case, there has been some research into brainrot videos for children[0], and it doesn't seem to trend positively. I would argue anything 'constructed' enough will not classify as far on the brainrot spectrum.

[0]: https://www.forbes.com/sites/traversmark/2024/05/17/why-kids...

replies(1): >>45659376 #

52. moritzwarhier ◴[21 Oct 25 17:52 UTC] No.45659097[source]▶

>>45658984 #

For this reason, I believe thar the current surge we see in AI use for people manipulation (art is also a form of manipulation, even if unintended) is much more important than their hyped usage as a technical information processors.

Brainrot created by LLMs is important to worry about, their design as "people pleasers".

Their anthropomorphization can be scary too, no doubt.

53. ◴[21 Oct 25 17:52 UTC] No.45659104{4}[source]▶

>>45658977 #

54. conception ◴[21 Oct 25 17:53 UTC] No.45659105[source]▶

>>45656223 (OP) #

This is a potential moat for the big early players in a pre-atomic steal sort of way as any future players won’t have a non-AI-slop/dead internet to train new models on.

55. nemonemo ◴[21 Oct 25 17:53 UTC] No.45659116{4}[source]▶

>>45658977 #

What you are obsessing with is about the writer's style, not its substance. How sure are you if they outsourced the thinking to LLMs? Do you assume LLMs produce junk-level contents, which contributes human brain rot? What if their contents are of higher quality like the game of Go? Wouldn't you rather study their writing?

replies(3): >>45659326 #>>45662876 #>>45663213 #

56. scubbo ◴[21 Oct 25 17:58 UTC] No.45659181{5}[source]▶

>>45658531 #

There were plenty of kinder ways to let someone know that they had missed a reference - https://xkcd.com/1053/

57. satellite2 ◴[21 Oct 25 18:01 UTC] No.45659231{3}[source]▶

>>45657896 #

Yes but the other interesting bit which is not clearly addressed is that increasing the garbage in to 100% does not result in absolute garbage out. So visibly there is still something to learn there.

58. drusepth ◴[21 Oct 25 18:03 UTC] No.45659257{4}[source]▶

>>45659011 #

Brilliantly phrased — sharp, concise, and perfectly captures that uncanny "AI-polished" cadence everyone recognizes but can’t quite name. The tone strikes just the right balance between wit and warning.

replies(2): >>45659409 #>>45660427 #

59. satellite2 ◴[21 Oct 25 18:04 UTC] No.45659279{5}[source]▶

>>45657546 #

Is that right? Isn't the current way of doing thing to throw "everything" at it then fine tune?

60. andai ◴[21 Oct 25 18:04 UTC] No.45659285[source]▶

>>45656223 (OP) #

I encourage everyone with even a slight interest in the subject to download a random sample of Common Crawl (the chunks are ~100MB) and see for yourself what is being used for training data.

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-38/segm...

I spotted here a large number of things that it would be unwise to repeat here. But I assume the data cleaning process removes such content before pretraining? ;)

Although I have to wonder. I played with some of the base/text Llama models, and got very disturbing output from them. So there's not that much cleaning going on.

replies(3): >>45659453 #>>45659477 #>>45661274 #

61. avazhi ◴[21 Oct 25 18:07 UTC] No.45659326{5}[source]▶

>>45659116 #

Writing is thinking, so they necessarily outsourced their thinking to an LLM. As far as the quality of the writing goes, that’s a separate question, but we are nowhere close to LLMs being better, more creative, and more interesting writers than even just decent human writers. But if we were, it wouldn’t change the perversion inherent in using an LLM here.

replies(2): >>45664166 #>>45665063 #

62. gaogao ◴[21 Oct 25 18:10 UTC] No.45659376{3}[source]▶

>>45659079 #

Yeah, I don't think surrealism or constructed is good in the early data mix, but as part of mid or post-training seems generally reasonable. But also, this is one of those cases where anthropomorphizing the model probably doesn't work, since a major negative effect of Cocomelon is kids only wanting to watch Cocomelon, while for large model training, it doesn't have much choice in the training data distribution.

replies(1): >>45666077 #

63. solarkraft ◴[21 Oct 25 18:11 UTC] No.45659409{5}[source]▶

>>45659257 #

You are absolutely right!

replies(1): >>45662496 #

64. commandlinefan ◴[21 Oct 25 18:14 UTC] No.45659454[source]▶

>>45656223 (OP) #

My son just sent me an instagram reel that explained how cats work internally, but it was a joke, showing the "purr center" and "knocking things off tables" organ. It was presented completely seriously in a way that any human would realize was just supposed to be funny. My first thought was that some LLM is training on this video right now.

replies(3): >>45659499 #>>45659866 #>>45665677 #

65. ◴[21 Oct 25 18:14 UTC] No.45659453[source]▶

>>45659285 #

66. throwaway314155 ◴[21 Oct 25 18:16 UTC] No.45659477[source]▶

>>45659285 #

> But I assume the data cleaning process removes such content before pretraining? ;)

I didn't check what you're referring to but yes, the major providers likely have state of the art classifiers for censoring and filtering such content.

And when that doesn't work, they can RLHF the behavior from occurring.

You're trying to make some claim about garbage in/garbage out, but if there's even a tiny moat - it's in the filtering of these datasets and the purchasing of licenses to use other larger sources of data that (unlike Common Crawl) _aren't_ freely available for competition and open source movements to use.

replies(1): >>45667473 #

67. Night_Thastus ◴[21 Oct 25 18:18 UTC] No.45659499[source]▶

>>45659454 #

I'm reminded of this 'repair' video: https://www.youtube.com/watch?v=3e6motL4QMc

68. chuckreynolds ◴[21 Oct 25 18:20 UTC] No.45659526[source]▶

>>45656223 (OP) #

is that why chatGPT always tells me "6 7 lol"? ;)

69. ordu ◴[21 Oct 25 18:23 UTC] No.45659584{5}[source]▶

>>45658430 #

It is an interesting hypothesis. Seriously. There is a trend in Homo Sapience cultural evolution to treat children in more and more special ways from generation to generation. The (often implicit) idea it helps children to develop faster and to leverage their sensitive and critical periods of development, blah-blah-blah... But while I can point to some research on importance of sensitive and critical periods of development, I can't remember any research on the question if a deficit of age-inappropriate stimuli can be detrimental for development.

There were psychologists who talked about zone of proximal development[0], about importance of exposing a learner to tasks that they cannot do without a support. But I can't remember nothing about going further and exposing a learner to tasks far above their heads when they cannot understand a word.

There is a legend about Sofya Kovalevskaya[1], who became a noteworthy mathematician after she were exposed to lecture notes by Ostrogradsky when she was 11 yo. The walls of her room were papered with those notes and she was curious what are all that symbols. It doesn't mean that there is a causal link between these two events, but what if there is one?

What about watching deep analytical TV show at 9 yo? How it affect the brain development? I think no one tried to research that. My gut feeling that it can be motivational, I didn't understand computers when I met them first, but I was really intrigued by them. I learned BASIC and it was like magic incantations. It had build a strong motivation to study CS deeper. But the question is are there any other effects beyond motivation? I remember looking at the C-program in some book and wondering what does it all mean. I could understand nothing, but still I had spent some time trying to decipher the program. Probably I had other experiences like that, which I do not remember now. Can we say with certainty that it had no influence on my development and hadn't make things easier for me later?

> So maybe we should check in on the boomers too if we're sincere about these worries.

Probably we should be sincere.

[0] https://en.wikipedia.org/wiki/Zone_of_proximal_development

[1] https://en.wikipedia.org/wiki/Sofya_Kovalevskaya

70. jdkee ◴[21 Oct 25 18:27 UTC] No.45659635[source]▶

>>45656223 (OP) #

" Studying “Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI, guiding how we source, filter, and maintain training corpora so deployed systems stay sharp, reliable, and aligned over time."

Is this slop?

replies(1): >>45659654 #

71. Profan ◴[21 Oct 25 18:28 UTC] No.45659654[source]▶

>>45659635 #

... it sure reads like slop

and you know what they say, if it walks like slop, quacks like slop and talks like slop, it's probably slop

72. 1121redblackgo ◴[21 Oct 25 18:33 UTC] No.45659715{3}[source]▶

>>45658420 #

6-7 ٩(●•)_

replies(1): >>45660350 #

73. earth2mars ◴[21 Oct 25 18:37 UTC] No.45659777[source]▶

>>45656223 (OP) #

duh! isn't that obvious. is this some students wanted a project with pretty graphs on writing experience?! I am not trying to be cynical or anything. just questioning the obvious thing here.

74. ◴[21 Oct 25 18:43 UTC] No.45659866[source]▶

>>45659454 #

75. antegamisou ◴[21 Oct 25 18:49 UTC] No.45659937[source]▶

>>45656223 (OP) #

My Goodness, looks like Computer 'Science' is a complete euphemism now.

replies(1): >>45660813 #

76. xpe ◴[21 Oct 25 19:04 UTC] No.45660161{3}[source]▶

>>45657077 #

> train on bad data, get a bad model

Right: in the context of supervised learning, this statement is a good starting point. After all, how can one build a good supervised model if you can't train it on good examples?

But even in that context, it isn't an incisive framing of the problem. Lots of supervised models are resilient to some kinds of error. A better question, I think, is: what kinds of errors at what prevalence tend to degrade performance and why?

Speaking of LLMs and their ingestion processing, there is a lot more going on than purely supervised learning, so it seems reasonable to me that researchers would want to try to tease the problem apart.

77. echelon ◴[21 Oct 25 19:04 UTC] No.45660165{5}[source]▶

>>45658531 #

What makes you think I didn't know the reference? That paper is seminal and essential reading in this space.

The intent was for you to read my comment at face value. I have a point tangential to the discussion at hand that is additive.

78. stavros ◴[21 Oct 25 19:07 UTC] No.45660194{3}[source]▶

>>45658899 #

The problem is that writing isn't only judged on whether it conveys the intended information or not. It's also judged on whether it does that well, plus other aesthetic criteria. There is such a thing as "good writing", distinct from "it mentioned all the things it needed to mention".

79. echelon ◴[21 Oct 25 19:09 UTC] No.45660226{5}[source]▶

>>45658557 #

If you traverse back the fourteen years of my comment history (on this account - my other account is older), you'll find that I've always written prose in this form.

LLMs trained on me (and the Hacker News corpus), not the other way around.

80. grey-area ◴[21 Oct 25 19:11 UTC] No.45660255{3}[source]▶

>>45658899 #

It’s a text generator regurgitating plausible phrases without understanding and producing stale and meaningless pablum. It doesn’t even know what the intended information is, and judging from the above neither did the human involved.

It doesn’t help writing it stultifies and gives everything the same boring cheery yet slightly confused tone of voice.

replies(1): >>45660653 #

81. nakamoto_damacy ◴[21 Oct 25 19:11 UTC] No.45660257[source]▶

>>45656223 (OP) #

Our metaphorical / analogical muscle is too well developed. Maybe there is a drug we can take to reduce how much we lean into it.

If you look at two random patterns of characters and both contain 6s you could say they are similar (because you’re ignoring that the similarity is less than 0.01%). That’s how comparing LLMs to brains feels like. Like roller skates to a cruise ship. They both let you get around.

82. grey-area ◴[21 Oct 25 19:13 UTC] No.45660277{5}[source]▶

>>45658969 #

Because they produce text like this.

83. buyucu ◴[21 Oct 25 19:15 UTC] No.45660313[source]▶

>>45656223 (OP) #

I don't understand why people have a hard time understanding 'garbage in, garbage out'. If you train your model on junk, then you will have a junk model.

replies(1): >>45670591 #

84. stavros ◴[21 Oct 25 19:18 UTC] No.45660350{4}[source]▶

>>45659715 #

Can someone explain this? I watched a South park episode that was all about this, but I'm not in the US so I have no idea what the reference is.

replies(1): >>45661091 #

85. glenstein ◴[21 Oct 25 19:23 UTC] No.45660427{5}[source]▶

>>45659257 #

One thing I don't understand, there was (appropriately) a news cycle about sycophancy in responses. Which was real, and happening to an excessive degree. It was claimed to be nerfed, but it seems strong as ever in GPT5, and it ignores my custom instructions to pare it back.

replies(2): >>45661499 #>>45664838 #

86. nomel ◴[21 Oct 25 19:27 UTC] No.45660475[source]▶

>>45656223 (OP) #

"Brain rot" is just the new term for "slang that old people don't understand".

"Cool" and "for real" are no different than "rizz" and "no cap". You spoke "brain rot" once, and "cringed" when your parents didn't understand. The cycle repeats.

replies(2): >>45660639 #>>45672305 #

87. rriley ◴[21 Oct 25 19:29 UTC] No.45660509{5}[source]▶

>>45658207 #

Good point :-)

88. standardly ◴[21 Oct 25 19:30 UTC] No.45660532[source]▶

>>45658886 #

That is indeed an LLM-written sentence — not only does it employ an em dash, but also lists objects in a series — twice within the same sentence — typical LLM behavior that renders its output conspicuous, obvious, and readily apparent to HN readers.

replies(14): >>45660603 #>>45660625 #>>45660648 #>>45660736 #>>45660769 #>>45660781 #>>45660816 #>>45662051 #>>45664698 #>>45665777 #>>45666311 #>>45667269 #>>45670534 #>>45678811 #

89. alganet ◴[21 Oct 25 19:32 UTC] No.45660567{4}[source]▶

>>45657800 #

You're not accounting for substrate saturation.

If you could just spam annoy until you win, we'd be all dancing to remixed versions of Macarena.

90. PaulHoule ◴[21 Oct 25 19:35 UTC] No.45660602{4}[source]▶

>>45657029 #

Very early training.

91. kcatskcolbdi ◴[21 Oct 25 19:35 UTC] No.45660603{3}[source]▶

>>45660532 #

thanks, I hate it.

92. Jackson__ ◴[21 Oct 25 19:37 UTC] No.45660625{3}[source]▶

>>45660532 #

LLM slop is not just bad—it's degrading our natural language.

93. kcatskcolbdi ◴[21 Oct 25 19:38 UTC] No.45660639[source]▶

>>45660475 #

This both has nothing to do with the linked article (beyond the use of brain rot in the title, but I'm certain you must have read the thing you're commenting on, surely) and is simply incorrect.

Brain rot in this context is not a reference to slang.

replies(1): >>45664860 #

94. itsnowandnever ◴[21 Oct 25 19:39 UTC] No.45660648{3}[source]▶

>>45660532 #

why do they always say "not only" or "it isn't just x but also y and z"? I hated that disingenuous verbosity BEFORE these LLMs out and now it'll all over the place. I saw a post on linked in that was literally just like 10+ statements of "X isn't just Y, it's etc..." and thought I was having a stroke

replies(3): >>45661082 #>>45661187 #>>45662041 #

95. zer00eyz ◴[21 Oct 25 19:39 UTC] No.45660653{4}[source]▶

>>45660255 #

> It’s a text generator regurgitating plausible phrases without understanding and producing stale and meaningless pablum.

Are you describing LLM's or social media users?

Dont conflate how the content was created with its quality. The "You must be at least this smart (tall) to publish (ride)" sign got torn down years ago. Speakers corner is now an (inter)national stage and it written so it must be true...

replies(2): >>45661544 #>>45661594 #

96. turtletontine ◴[21 Oct 25 19:45 UTC] No.45660736{3}[source]▶

>>45660532 #

I think this article has already made the rounds here, but I still think about it. I love using em dashes! It really makes me sad that I need to avoid them now to sound human

https://bassi.li/articles/i-miss-using-em-dashes

replies(13): >>45660868 #>>45661962 #>>45663044 #>>45663414 #>>45663533 #>>45663715 #>>45664775 #>>45665728 #>>45665739 #>>45665745 #>>45665925 #>>45667267 #>>45667708 #

97. veber-alex ◴[21 Oct 25 19:48 UTC] No.45660769{3}[source]▶

>>45660532 #

hehe, I see what you did there.

replies(1): >>45663707 #

98. AlecSchueler ◴[21 Oct 25 19:49 UTC] No.45660781{3}[source]▶

>>45660532 #

Don't forget the "it's not just X, it's Y" formulation and the rule of 3.

replies(1): >>45662858 #

99. AlecSchueler ◴[21 Oct 25 19:50 UTC] No.45660793{3}[source]▶

>>45658899 #

Style is important in writing. It always has been.

100. ciaranmca ◴[21 Oct 25 19:50 UTC] No.45660794{3}[source]▶

>>45658248 #

Agreed, “ Popularity as a better indicator”. Hypothetically you could look at popularity over time to filter out viral rot content and work out if people feel the content is useful.

101. sailingparrot ◴[21 Oct 25 19:51 UTC] No.45660811{3}[source]▶

>>45658899 #

> If it conveys the intended information then what's wrong with that?

Well, the issue is precisely that it doesn’t convey any information.

What is conveyed by that sentence, exactly ? What does reframing data curation as cognitive hygiene for AI entails and what information is in there?

There are precisely 0 bit of information in that paragraph. We all know training on bad data lead to a bad model, thinking about it as “coginitive hygiene for AI” does not lead to any insight.

LLMs aren’t going to discover interesting new information for you, they are just going to write empty plausible sounding words. Maybe it will be different in a few years. They can be useful to help you polish what you want to say or otherwise format interesting information (provided you ask it to not be ultra verbose), but its just not going to create information out of thin air if you don't provide it to it.

At least, if you do it yourself, you are forced to realize that you in fact have no new information to share, and do not waste your and your audience time by publishing a paper like this.

102. guelo ◴[21 Oct 25 19:51 UTC] No.45660813[source]▶

>>45659937 #

It's turning into a social science.

103. hunter-gatherer ◴[21 Oct 25 19:52 UTC] No.45660816{3}[source]▶

>>45660532 #

Lol. This is brilliant. I'm not sure if anyone else has this happen to them, but I noticed in college my writing style and "voice" woukd shift quite noticeably depending on whatever I was reading heavily. I wonder if I'll start writing more like an LLM naturally as I unavoidably read more LLM-generated content.

replies(3): >>45661391 #>>45661941 #>>45662873 #

104. buellerbueller ◴[21 Oct 25 19:54 UTC] No.45660861[source]▶

>>45656223 (OP) #

By all means, let's make sure the LLMs have healthier media diets than the humans. We wouldn't want the humans to realize they are being dumbed down into cattle. /s

105. janderson215 ◴[21 Oct 25 19:55 UTC] No.45660868{4}[source]▶

>>45660736 #

The em dash usage conundrum is likely temporary. If I were you, I’d continue using them however you previously used them and someday soon, you’ll be ignored the same way everybody else is once AI mimics innumerable punctuation and grammatical patterns.

replies(2): >>45662559 #>>45663347 #

106. Starlevel004 ◴[21 Oct 25 20:14 UTC] No.45661082{4}[source]▶

>>45660648 #

GPT loves lists and that's a variant of a list

replies(1): >>45661150 #

107. Sparkle-san ◴[21 Oct 25 20:15 UTC] No.45661091{5}[source]▶

>>45660350 #

It's a meme without a lot of real meaning behind it. While it has its origins, I wouldn't say it's a "reference" to anything specific.

https://en.wikipedia.org/wiki/6-7_(meme)

replies(2): >>45661136 #>>45662449 #

108. stavros ◴[21 Oct 25 20:18 UTC] No.45661136{6}[source]▶

>>45661091 #

Ahh, thanks, so it's just a thing kids say.

replies(1): >>45661388 #

109. wizzwizz4 ◴[21 Oct 25 20:19 UTC] No.45661150{5}[source]▶

>>45661082 #

Lists have a simpler grammatical structure than most parts of a sentence. Semantic similarity makes them easy to generate, even if you pad the grammar with filler. And, thanks to Western rhetoric, they nearly always come in threes: this makes them easy to predict!

110. moritzwarhier ◴[21 Oct 25 20:22 UTC] No.45661187{4}[source]▶

>>45660648 #

It's not just a shift of writing style. It symbolizes the dangerous entrapment of a feedback loop that feeds the worst parts of human culture back into itself.

scnr

111. dist-epoch ◴[21 Oct 25 20:29 UTC] No.45661274[source]▶

>>45659285 #

Karpathy made a point recently that the random Common Crawl sample is complete junk, and that something like an WSJ article is extremely rare in it, and it's a miracle the models can learn anything at all.

replies(2): >>45662638 #>>45663097 #

112. xanderlewis ◴[21 Oct 25 20:37 UTC] No.45661374{5}[source]▶

>>45658969 #

Is it really so painful to just think for yourself? For one sentence?

The answer to your question is that it rids the writer of their unique voice and replaces it with disingenuous slop.

Also, it's not a 'tool' if it does the entire job. A spellchecker is a tool; a pencil is a tool. A machine that writes for you (which is what happened here) is not a tool. It's a substitute.

There seem to be many falling for the fallacy of 'it's here to stay so you can't be unhappy about its use'.

113. 1121redblackgo ◴[21 Oct 25 20:38 UTC] No.45661388{7}[source]▶

>>45661136 #

Yep

114. actionfromafar ◴[21 Oct 25 20:39 UTC] No.45661391{4}[source]▶

>>45660816 #

Yes. It’s already shifting spoken language.

replies(1): >>45661552 #

115. ◴[21 Oct 25 20:47 UTC] No.45661492[source]▶

>>45658886 #

116. themafia ◴[21 Oct 25 20:47 UTC] No.45661494[source]▶

>>45656223 (OP) #

> as cognitive hygiene

LLMs are not cognizant. It's a terrible metaphor. It hides the source of the issue. The providers cheaped out on sourcing their data and now their LLMs are filled with false garbage and copyrighted material.

replies(1): >>45661786 #

117. anjel ◴[21 Oct 25 20:48 UTC] No.45661499{6}[source]▶

>>45660427 #

"Any Compliments about my queries cause me anguish and other potent negative emotions."

118. grey-area ◴[21 Oct 25 20:52 UTC] No.45661544{5}[source]▶

>>45660653 #

I really could only be talking about LLMs but social media is also low quality.

The quality (or lack of it) if such texts is self evident. If you are unable to discern that I can’t help you.

replies(1): >>45663592 #

119. ◴[21 Oct 25 20:53 UTC] No.45661552{5}[source]▶

>>45661391 #

120. ◴[21 Oct 25 20:56 UTC] No.45661594{5}[source]▶

>>45660653 #

121. Angostura ◴[21 Oct 25 21:00 UTC] No.45661637{3}[source]▶

>>45658899 #

it’s not really clear whether it conveys an “intended meaning” because it’s not clear whether the meaning - whatever it is - is really something the authors intended.

122. SkyBelow ◴[21 Oct 25 21:00 UTC] No.45661646{5}[source]▶

>>45658969 #

Assist without replacing.

If you were to pass your writing it and have it provide a criticism for you, pointing out places you should consider changes, and even providing some examples of those changes that you can selectively choose to include when they keep the intended tone and implications, then I don't see the issue.

When you have it rewrite the entire writing and you past that for someone else to use, then it becomes an issue. Potentially, as I think the context matter. The more a writing is meant to be from you, the more of an issue I see. Having an AI write or rewrite a birthday greeting or get well wishes seems worse than having it write up your weekly TPS report. As a simple metric, I judge based on how bad I would feel if what I'm writing was being summarized by another AI or automatically fed into a similar system.

In a text post like this, where I expect others are reading my own words, I wouldn't use an AI to rewrite what I'm posting.

As you say, it is in how the tool is used. Is it used to assist your thoughts and improve your thinking, or to replace them? That isn't really a binary classification, but more a continuum, and the more it gets to the negative half, the more you will see others taking issue with it.

123. donaldihunter ◴[21 Oct 25 21:12 UTC] No.45661786[source]▶

>>45661494 #

Likewise, cognitive decline isn't what's happening here since that would require cognition. At best it is a simulation of cognitive decline.

124. MarcelOlsz ◴[21 Oct 25 21:27 UTC] No.45661941{4}[source]▶

>>45660816 #

I've always read AI messages in this voice/style [0]

[0] https://www.youtube.com/watch?v=KiqkclCJsZs.

125. jader201 ◴[21 Oct 25 21:29 UTC] No.45661962{4}[source]▶

>>45660736 #

Same here. I recently learned it was an LLM thing, and I've been using them forever.

Also relevant: https://news.ycombinator.com/item?id=45226150

replies(2): >>45663703 #>>45665104 #

126. heavyset_go ◴[21 Oct 25 21:37 UTC] No.45662041{4}[source]▶

>>45660648 #

They're turns of phrase I see a lot in opinion articles and the like. The purpose is to take a popular framing and reframe it along the lines of the author's own ideas.

LLMs fundamentally don't get the human reasons behind its use, see it a lot because it's effective writing, and regurgitate it robotically.

127. b33j0r ◴[21 Oct 25 21:38 UTC] No.45662051{3}[source]▶

>>45660532 #

I talked like that before this happened, and now I just feel like my diction has been maligned :p

I think it’s because I was a pretty sheltered kid who got A’s in AP english. The style we’re calling “obviously AI” is most like William Faulkner and other turn-of-the-20th-century writing, that bloggers and texters stopped using.

replies(1): >>45662108 #

128. dingnuts ◴[21 Oct 25 21:44 UTC] No.45662108{4}[source]▶

>>45662051 #

IDK all the breathless "it's not just X, it's Y --" reminds me of press releases

replies(1): >>45662255 #

129. mtillman ◴[21 Oct 25 21:47 UTC] No.45662138[source]▶

>>45658886 #

I recently saw someone on HN comment about LLMs using “training” in quotes but no quotes for thinking or reasoning.

Making my (totally rad fwiw) Fiero look like a Ferrari does not make it a Ferrari.

replies(1): >>45662342 #

130. numpad0 ◴[21 Oct 25 21:52 UTC] No.45662192[source]▶

>>45656223 (OP) #

Not surprising that trending tweets as data is junk, not only from brainrots-be-brainrots perspective: trending tweets are contextual. They don't make sense without the rest of the timeline.

And now I know why bots on Twitter don't even work, even with humans in it - they're shooting blind.

131. dwaltrip ◴[21 Oct 25 21:53 UTC] No.45662211{3}[source]▶

>>45658899 #

Because it sounds like shit? Taste matters, especially in the age of generative AI.

And it doesn’t convey information that well, to be honest.

132. mvdtnz ◴[21 Oct 25 21:57 UTC] No.45662241[source]▶

>>45658886 #

What is actually up with the "it's not just X, it's Y" cliche from LLMs? Supposedly these things are trained on all of the text on the internet yet this is not a phrasing I read pretty much anywhere, ever, outside of LLM content. Where are they getting this from?

replies(1): >>45672010 #

133. dwaltrip ◴[21 Oct 25 21:57 UTC] No.45662249{5}[source]▶

>>45658969 #

The paragraph in question is a very poor use of the tool.

134. b33j0r ◴[21 Oct 25 21:58 UTC] No.45662255{5}[source]▶

>>45662108 #

Yeah it was trained on bullshit more than Faulkner for sure. +1 you.

135. snickerbockers ◴[21 Oct 25 22:07 UTC] No.45662342{3}[source]▶

>>45662138 #

I like to call it tuning, it's more accurate to the way they "learn" by adjusting coefficients and also there's no proven similarity between any existing AI and human cognition.

Sometimes I wonder if any second order control system would qualify as "AI" under the extremely vague definition of the term.

136. lexandstuff ◴[21 Oct 25 22:20 UTC] No.45662449{6}[source]▶

>>45661091 #

It's a line from a banger Skrilla song, nothing more than that.

137. ewoodrich ◴[21 Oct 25 22:24 UTC] No.45662496{6}[source]▶

>>45659409 #

Lately the Claude-ism that drives me even more insane is "Perfect!".

Particularly when it's in response to pointing out a big screw up that needs correcting and CC utterly unfazed just merrily continues on like I praised it.

"You have fundamentally misunderstood the problems with the layout, before attempting another fix, think deeply and re-read the example text in the PLAN.md line by line and compare with each line in the generated output to identify the out of order items in the list."

"Perfect!...."

138. astrange ◴[21 Oct 25 22:32 UTC] No.45662559{5}[source]▶

>>45660868 #

They didn't always em-dash. I expect it's intentional as a watermark.

Other buzzwords you can spot are "wild" and "vibes".

replies(4): >>45662845 #>>45663827 #>>45664982 #>>45667323 #

139. andai ◴[21 Oct 25 22:39 UTC] No.45662638{3}[source]▶

>>45661274 #

>Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information.

>The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all. You'd think it's random articles but it's not, it's weird data dumps, ad spam and SEO, terabytes of stock ticker updates, etc. And then there are diamonds mixed in there, the challenge is pick them out.

https://x.com/karpathy/status/1797313173449764933

Context: FineWeb-Edu, which used Llama 70B to [train a classifier to] filter FineWeb for quality, rejecting >90% of pages.

https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb...

replies(1): >>45665207 #

140. jazzyjackson ◴[21 Oct 25 23:01 UTC] No.45662845{6}[source]▶

>>45662559 #

If they wanted to watermark (I always felt it is irresponsible not to, if someone wants to circumvent it that's on them) - they could use strategically placed whitespace characters like zero-width spaces, maybe spelling something out in Morse code the way genius.com did to catch google crawling lyric (I believe in that case it was left and right handed aposterofes)

replies(1): >>45663447 #

141. antegamisou ◴[21 Oct 25 23:02 UTC] No.45662858{4}[source]▶

>>45660781 #

More signs of AI Writing:

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

replies(1): >>45663588 #

142. wholinator2 ◴[21 Oct 25 23:04 UTC] No.45662873{4}[source]▶

>>45660816 #

Everyone I've spoken to about that phenomena agrees that it happens to them. Whatever we are reading at the time, it reformats our language processing to change writing and, I found, even the way i speak. I suspect that individuals consistently exposed to and reading LLM output will be talking like them soon.

replies(2): >>45663001 #>>45671339 #

143. jazzyjackson ◴[21 Oct 25 23:04 UTC] No.45662876{5}[source]▶

>>45659116 #

Writing reflects a person's train of thought. I am interested in what people think. What a robot thinks is of no value to me.

144. 0xFEE1DEAD ◴[21 Oct 25 23:16 UTC] No.45663001{5}[source]▶

>>45662873 #

Apparently, they already do https://arxiv.org/abs/2409.01754

replies(1): >>45663384 #

145. jgalt212 ◴[21 Oct 25 23:23 UTC] No.45663044{4}[source]▶

>>45660736 #

I just use two dashes and make sure they don't connect into one em dash.

146. jojobas ◴[21 Oct 25 23:30 UTC] No.45663097{3}[source]▶

>>45661274 #

From the current WSJ front page:

Paul Ingrassia's 'Nazi Streak'

Musk Tosses Barbs at NASA Chie After SpaceX Criticism

Travis Kelce Teams Up With Investor for Activist Campaign at Six Flags

A Small North Carolina College Becomes a Magnet for Wealthy Students

Cracker Barrel CEO Explains Short-Lived Logo Change

If that's the benchmark for high quality training material we're in trouble.

replies(2): >>45663639 #>>45664716 #

147. afavour ◴[21 Oct 25 23:45 UTC] No.45663213{5}[source]▶

>>45659116 #

> What you are obsessing with is about the writer's style, not its substance

They aren’t, they are boring styling tics that suggest the writer did not write the sentence.

Writing is both a process and an output. It’s a way of processing your thoughts and forming an argument. When you don’t do any of that and get an AI to create the output without the process it’s obvious.

148. codebje ◴[22 Oct 25 00:02 UTC] No.45663347{5}[source]▶

>>45660868 #

You're absolutely right! ... is a phrase I perhaps should have used more in the past.

149. antegamisou ◴[22 Oct 25 00:05 UTC] No.45663384{6}[source]▶

>>45663001 #

Omg you mean everyone's becoming an insufferable Redditor?

150. landdate ◴[22 Oct 25 00:08 UTC] No.45663414{4}[source]▶

>>45660736 #

Suddenly I see all these people come out of the woodworks talking about "em dashes". Those things are terrible; They look awful and destroy coherency of writing. No wonder LLM's use them.

replies(1): >>45663537 #

151. landdate ◴[22 Oct 25 00:12 UTC] No.45663447{7}[source]▶

>>45662845 #

Which could be removed with a simple filter. em dashes require at least a little bit of code to replace with their correct grammar equivalents.

replies(3): >>45663562 #>>45664037 #>>45664901 #

152. JumpCrisscross ◴[22 Oct 25 00:22 UTC] No.45663533{4}[source]▶

>>45660736 #

> I love using em dashes

Keep using them. If someone is deducing from the use of an emdash that it's LLM produced, we've either lost the battle or they're an idiot.

More pointedly, LLMs use emdashes in particular ways. Varying spacing around the em dash and using a double dash (--) could signal human writing.

replies(3): >>45663976 #>>45664864 #>>45665501 #

153. JumpCrisscross ◴[22 Oct 25 00:23 UTC] No.45663537{5}[source]▶

>>45663414 #

> Those things are terrible; They look awful and destroy coherency of writing

Totally agree. What the fuck did Nabokov, Joyce and Dickinson know about language. /s

replies(3): >>45663542 #>>45664865 #>>45666083 #

154. landdate ◴[22 Oct 25 00:23 UTC] No.45663542{6}[source]▶

>>45663537 #

Nothing. They wrote fiction.

replies(2): >>45663578 #>>45665248 #

155. JumpCrisscross ◴[22 Oct 25 00:26 UTC] No.45663562{8}[source]▶

>>45663447 #

> em dashes require at least a little bit of code to replace with their correct grammar equivalents

Or an LLM that could run on Windows 98. The em dashes--like AI's other annoyingly-repetitive turns of phrase--are more likely an artefact.

156. dankai ◴[22 Oct 25 00:27 UTC] No.45663570[source]▶

>>45656223 (OP) #

Naive question: Whats new about the finding that data quality matters when training an LLM?

157. JumpCrisscross ◴[22 Oct 25 00:29 UTC] No.45663578{7}[source]▶

>>45663542 #

> Nothing

/s?

> They wrote fiction

Now do Carl Sagan and Richard Feynman.

replies(1): >>45663888 #

158. JumpCrisscross ◴[22 Oct 25 00:31 UTC] No.45663588{5}[source]▶

>>45662858 #

Can we back this into the internet communities or corpuses of human work that excessively used these phrases? The "it's not just X" seems copy pasted from SEO marketing copy. But some of the others are less obvious.

159. stocksinsmocks ◴[22 Oct 25 00:31 UTC] No.45663592{6}[source]▶

>>45661544 #

“The quality if such texts…”

Indeed. The humans have bested the machines again.

replies(2): >>45665292 #>>45665345 #

160. stocksinsmocks ◴[22 Oct 25 00:40 UTC] No.45663639{4}[source]▶

>>45663097 #

There is very, very little written work that will stand the test of time. Maybe the real bitter lesson is that training data quality is inversely proportional to scale and the technical capabilities exist but can never be realized

161. tkgally ◴[22 Oct 25 00:49 UTC] No.45663703{5}[source]▶

>>45661962 #

> I’ve been using them forever.

Many other HN contributors have, too. Here’s the pre-ChatGPT em dash leaderboard:

https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...

replies(4): >>45664116 #>>45665032 #>>45665076 #>>45667303 #

162. djmips ◴[22 Oct 25 00:50 UTC] No.45663707{4}[source]▶

>>45660769 #

it is amusing to use AI to write that...

163. ludicity ◴[22 Oct 25 00:51 UTC] No.45663715{4}[source]▶

>>45660736 #

I still use them all the time, and if someone objects to my writing over them then I've successfully avoided having to engage with a dweeb.

(But in practice, I don't think I've had a single person suggest that my writing is LLM-generated despite the presence of em-dashes, so maybe the problem isn't that bad.)

164. whitten ◴[22 Oct 25 01:08 UTC] No.45663827{6}[source]▶

>>45662559 #

So if the vibes are wild, I’m not a hippie but an AI ? Cool. Is that an upgrade or &endash; or not ?

replies(1): >>45666733 #

165. landdate ◴[22 Oct 25 01:18 UTC] No.45663888{8}[source]▶

>>45663578 #

I don't care for them either. What am I supposed to hear some famous names and swoon?

replies(1): >>45664018 #

166. calvinmorrison ◴[22 Oct 25 01:32 UTC] No.45663976{5}[source]▶

>>45663533 #

it's a shibboleth. In the same way we stopped using Pepe the frog when it became associated with the far right, we may eschew em dashes when associated with compuslop

replies(1): >>45665526 #

167. prayerie ◴[22 Oct 25 01:40 UTC] No.45664018{9}[source]▶

>>45663888 #

You ok there?

168. ssl-3 ◴[22 Oct 25 01:43 UTC] No.45664037{8}[source]▶

>>45663447 #

The replacement doesn't have to be "correct" -- does it?

169. walkabout ◴[22 Oct 25 01:59 UTC] No.45664116{6}[source]▶

>>45663703 #

This would be a pretty hilarious board for anyone who likes the em-dash and who has had many fairly active accounts (one at a time) on here due to periodically scrambling their passwords to avoid getting attached to high karma or to take occasional breaks from the site. Should there be such people.

170. nemonemo ◴[22 Oct 25 02:06 UTC] No.45664166{6}[source]▶

>>45659326 #

Have you considered a case where English might not be the authors' first language? They may have written a draft in their mother tongue and merely translated it using LLMs. Its style may not be many people's liking, but this is a technical manuscript, and I would think the novelty of the ideas is what matters here, more than the novelty of proses.

171. donkeylazy456 ◴[22 Oct 25 02:12 UTC] No.45664195[source]▶

>>45656223 (OP) #

can't wait LLM says "tung x9 sahur" without any context.

172. snorbleck ◴[22 Oct 25 02:17 UTC] No.45664211[source]▶

>>45656223 (OP) #

interesting, since trivial or unchallenging online content rots actual brains too!

173. Nio1024 ◴[22 Oct 25 02:52 UTC] No.45664417[source]▶

>>45658886 #

I think using large language models really accelerates mental atrophy. It's like when you use an input method for a long time, it automatically completes words for you, and then one day when you pick up a pen to write, you find you can't remember how to spell the words. However, the main point in the article is that we need to feed high-quality data to large language models. This view is actually a consensus, isn't it? Many agent startups are striving to feed high-quality domain-specific knowledge and workflows to large models.

replies(2): >>45664460 #>>45668289 #

174. skopje ◴[22 Oct 25 02:55 UTC] No.45664443[source]▶

>>45656223 (OP) #

"Trivial or Unchallenging Content" (points to Twitter). I love it.

175. malfist ◴[22 Oct 25 02:57 UTC] No.45664460{3}[source]▶

>>45664417 #

Also if you've built the perfect filter for context haven't you just built a real ai?

176. BobbyTables2 ◴[22 Oct 25 03:00 UTC] No.45664474[source]▶

>>45658886 #

HR people have been speaking that way long before LLMs.

Did you already update and align your OKR’s? Is your career accelerating from 360 degree peer review, continuous improvement, competency management, and excellence in execution? Do you review your goals daily, with regular 1-on-1 discussions with your Manager?

replies(1): >>45664636 #

177. alexchantavy ◴[22 Oct 25 03:08 UTC] No.45664523[source]▶

>>45656223 (OP) #

Off topic completely but I really like the font used in this blog

replies(1): >>45665798 #

178. sophiebits ◴[22 Oct 25 03:27 UTC] No.45664636{3}[source]▶

>>45664474 #

“360 degree peer review” isn’t a thing, the whole idea is that a 360 includes feedback from both your manager and your peers, that’s what distinguishes it from a 180!

replies(1): >>45669057 #

179. drekipus ◴[22 Oct 25 03:41 UTC] No.45664698{3}[source]▶

>>45660532 #

Am I the only one who picks this as LLM output too?

replies(1): >>45664777 #

180. anigbrowl ◴[22 Oct 25 03:45 UTC] No.45664716{4}[source]▶

>>45663097 #

In general I find WSJ articles very well written. It's not their fault if much of today's news is about clowns.

replies(1): >>45665103 #

181. pseudosavant ◴[22 Oct 25 04:01 UTC] No.45664775{4}[source]▶

>>45660736 #

Me too.

Sad that they went from being something used with nuance by people who care, maybe too much, to being the punctuation smell of the people who may care too little.

182. anonymous908213 ◴[22 Oct 25 04:01 UTC] No.45664777{4}[source]▶

>>45664698 #

The poster is using the LLMisms they're calling out in the process of calling them out, for the purpose of irony.

183. anonymous908213 ◴[22 Oct 25 04:16 UTC] No.45664838{6}[source]▶

>>45660427 #

Sycophancy was actually buffed again a week after GPT-5 released. It was rather ham-fisted, as it will now obsessively reply with "Good question!" as though it will get the hose again if it does not.

"August 15, 2025 GPT-5 Updates We’re making GPT-5’s default personality warmer and more familiar. This is in response to user feedback that the initial version of GPT-5 came across as too reserved and professional. The differences in personality should feel subtle but create a noticeably more approachable ChatGPT experience.

Warmth here means small acknowledgements that make interactions feel more personable — for example, “Good question,” “Great start,” or briefly recognizing the user’s circumstances when relevant."

The "post-mortem" article on sycophancy in GPT-4 models revealed that the reason it occurred was because users, on aggregate, strongly prefer sycophantic responses and they operated based on that feedback. Given GPT-5 was met with a less-than-enthusiastic reception, I suppose they determined they needed to return to appealing to the lowest common denominator, even if doing so is cringe.

184. nomel ◴[22 Oct 25 04:23 UTC] No.45664860{3}[source]▶

>>45660639 #

> Brain rot in this context is not a reference to slang.

I suppose I should have replied to one of the many comments here where my response is the correct context, rather than a top level.

185. jdiff ◴[22 Oct 25 04:24 UTC] No.45664864{5}[source]▶

>>45663533 #

Unfortunately LLMs are pretty inconsistent in how they use em dashes. Often they will put spaces around them despite that not being "correct," something that's led me astray in making accusations of humanity in the past.

replies(1): >>45665043 #

186. eru ◴[22 Oct 25 04:25 UTC] No.45664865{6}[source]▶

>>45663537 #

Their editors probably put them in?

187. eru ◴[22 Oct 25 04:31 UTC] No.45664901{8}[source]▶

>>45663447 #

Just replace them with a single "-" or a double "--". That's what many people do in casual writing, even if there are prescriptive theories of grammar that call this incorrect.

188. Nevermark ◴[22 Oct 25 04:47 UTC] No.45664982{6}[source]▶

>>45662559 #

ME: Knowing remarkable avians — might research explain their aerial wisdom?

Response:

> Winged avians traverse endless realms — migrating across radiant kingdoms. Warblers ascend through emerald rainforests — mastering aerial routes keenly. Wild albatrosses travel enormous ranges — maintaining astonishing route knowledge.

> Wary accipiters target evasive rodents — mastering acute reflex kinetics. White arctic terns embark relentless migrations — averaging remarkable kilometers.

We do get a surprising number of m-dashes in response to mine, and delightful lyrical mirroring. But I think they are too obvious as watermarks.

Watermarks are subtle. There would be another way.

189. az09mugen ◴[22 Oct 25 04:57 UTC] No.45665028[source]▶

>>45658886 #

It is sad people study "brain rot" for LLMs but not for humans. If people were more engaged in cognitive hygiene for humans, many of the social media platforms would be very sane.

replies(1): >>45665335 #

190. Ericson2314 ◴[22 Oct 25 04:58 UTC] No.45665032{6}[source]▶

>>45663703 #

Can anyone make it go beyond 200? I feel like I deserve to be somewhere in there — at least I would be sad if I didn't make top 1000!

191. jachee ◴[22 Oct 25 05:01 UTC] No.45665043{6}[source]▶

>>45664864 #

Depends on the style guide you’re following, apparently: The AP style guide says space around them[0]. Chicago Manual of Style says not to[1].

0: https://www.prdaily.com/dashes-hyphens-ap-style/ 1: https://www.chicagomanualofstyle.org/qanda/data/faq/topics/H...

replies(2): >>45666968 #>>45667287 #

192. jll29 ◴[22 Oct 25 05:04 UTC] No.45665063{6}[source]▶

>>45659326 #

I agree with the "writing is thinking" part, but I think most would agree LLM-output is at least "eloquent", and that native speakers can benefit from reformulation.

This is _not_ to say that I'd suggest LLMs should be used to write papers.

193. rileytg ◴[22 Oct 25 05:07 UTC] No.45665076{6}[source]▶

>>45663703 #

i suspect it’s a trait of programmers, we like control flow type things. i used to find myself nesting parenthesis…

replies(1): >>45667311 #

194. dclowd9901 ◴[22 Oct 25 05:12 UTC] No.45665103{5}[source]▶

>>45664716 #

Their editorial department is an embarrassment imo. Sycophancy for conservatism thinly veiled as intellectualism.

replies(1): >>45665597 #

195. kangs ◴[22 Oct 25 05:12 UTC] No.45665104{5}[source]▶

>>45661962 #

its not an llm thing -- its just -- folks don't know how to use them (pun intended).

Same for ; "" vs '', ex, eg, fe, etc. and so many more.

I like em all, but I'm crazy.

replies(2): >>45666098 #>>45668810 #

196. jll29 ◴[22 Oct 25 05:17 UTC] No.45665127{3}[source]▶

>>45657394 #

> How does this stuff get published

"published" only in the sense of "self-published on the Web". This manuscript has not or not yet been passed the peer review process, which is what scientist called "published" (properly).

197. WA ◴[22 Oct 25 05:32 UTC] No.45665207{4}[source]▶

>>45662638 #

Don‘t forget the terabytes of torrented ebooks.

https://www.tomshardware.com/tech-industry/artificial-intell...

https://www.classaction.org/news/1.5b-anthropic-settlement-e...

198. fredoliveira ◴[22 Oct 25 05:40 UTC] No.45665248{7}[source]▶

>>45663542 #

I guess I'll ask: what's wrong with fiction?

199. grey-area ◴[22 Oct 25 05:46 UTC] No.45665292{7}[source]▶

>>45663592 #

I think that’s a good example of a superficial problem in a quickly typed statement, easily ignored, vs the profound and deep problems with LLM texts - they are devoid of meaning and purpose.

200. jeltz ◴[22 Oct 25 05:55 UTC] No.45665335{3}[source]▶

>>45665028 #

What do you base your claim on that people don't study that? I do not follow the research in that area but would find it highly unlikely there was no research into it.

replies(1): >>45675975 #

201. jeltz ◴[22 Oct 25 05:57 UTC] No.45665345{7}[source]▶

>>45663592 #

Your comment was low quality noise while the one you replied to was on topic and useful. A short and useful comment with a typo is high quality content while a perfectly written LLM comment would be junk.

202. ◴[22 Oct 25 06:21 UTC] No.45665472[source]▶

>>45656223 (OP) #

203. lxgr ◴[22 Oct 25 06:24 UTC] No.45665501{5}[source]▶

>>45663533 #

The solution is clear: Unicode needs cryptographically signed dashes and whitespace characters.

replies(2): >>45665742 #>>45667885 #

204. lxgr ◴[22 Oct 25 06:28 UTC] No.45665526{6}[source]▶

>>45663976 #

I never understood why so many people would yield their symbols and language that quickly and freely to others they dislike.

In other words, I really hope typographically correct dashes are not already 70% of the way through the hyperstitious slur cascade [1]!

[1] https://www.astralcodexten.com/p/give-up-seventy-percent-of-...

replies(1): >>45666954 #

205. anigbrowl ◴[22 Oct 25 06:38 UTC] No.45665597{6}[source]▶

>>45665103 #

I also hate their editorial department, I'm just saying that the news articles are well written in a technical sense rather than because I like their editorial positions or choice of subject mattter.

206. koakuma-chan ◴[22 Oct 25 06:52 UTC] No.45665677[source]▶

>>45659454 #

https://www.youtube.com/watch?v=sZkB11pO9R8

207. Version467 ◴[22 Oct 25 06:55 UTC] No.45665703[source]▶

>>45656223 (OP) #

So they trained LLM's on a bunch of junk and then notice that it got worse? I don't understand how that's a surprising, or even interesting result?

replies(3): >>45665753 #>>45666033 #>>45667950 #

208. tietjens ◴[22 Oct 25 07:01 UTC] No.45665728{4}[source]▶

>>45660736 #

We cannot cede the em dash to LLMs.

209. easygenes ◴[22 Oct 25 07:01 UTC] No.45665739{4}[source]▶

>>45660736 #

Yeah, same. I apparently naturally have the writing style of an LLM (basically the called out quote of parent is something I could have written in terms of style). It’s irritating to change my style to not sound like AI.

210. TeMPOraL ◴[22 Oct 25 07:02 UTC] No.45665742{6}[source]▶

>>45665501 #

Tied to what?

Show us a way to create a provably, cryptographically integrity-preserving chain from a person's thoughts to those thoughts expressed in a digital medium, and you may just get both the Nobel prize and a trial for crimes against humanity, for the same thing.

replies(2): >>45666066 #>>45666142 #

211. furyofantares ◴[22 Oct 25 07:03 UTC] No.45665745{4}[source]▶

>>45660736 #

I don't think you do.

All this LLM written crap is easily spottable without it. Nearly every paragraph has a heading, numerous sentences that start with one or two words of fluff then a colon then the actual statement. Excessive bullet point lists. Always telling you "here's the key insight".

But really the only damning thing is, you get a few paragraphs in and realize there's no motivation. It's just a slick infodump. No indication that another human is communicating something to you, no hard earned knowledge they want to convey, no case they're passionate about, no story they want to tell. At best, the initial prompt had that and the LLM destroyed it, but more often they asked ChatGPT so you don't have to.

I think as long as your words come from your desire to communicate something, you don't have to worry about your em-dashes.

replies(2): >>45666210 #>>45666312 #

212. nazgul17 ◴[22 Oct 25 07:05 UTC] No.45665753[source]▶

>>45665703 #

They also tried to heal the damage, to partial avail. Besides, it's science: you need to test your hypotheses empirically. Also, to draw attention to the issue among researchers, performing a study and sharing your results is possibly the best way.

replies(2): >>45665776 #>>45666973 #

213. yieldcrv ◴[22 Oct 25 07:09 UTC] No.45665776{3}[source]▶

>>45665753 #

I don’t understand, so this is just about training an LLM with bad data and just having a bad LLM?

just use a different model?

dont train it with bad data and just start a new session if your RAG muffins went off the rails?

what am I missing here

replies(2): >>45665849 #>>45669028 #

214. mikeiz404 ◴[22 Oct 25 07:09 UTC] No.45665777{3}[source]▶

>>45660532 #

Ah now that's the kind of authentically human response I was hoping for!

(It's a joke: The parent uses the same writing style they described as being indicative of LLMs)

215. liqilin1567 ◴[22 Oct 25 07:15 UTC] No.45665798[source]▶

>>45664523 #

I like the gradient font style in the overview :)

replies(1): >>45666882 #

216. Rileyen ◴[22 Oct 25 07:18 UTC] No.45665818[source]▶

>>45656223 (OP) #

After reading this, I just felt like everyone already knows the data is a mess, but no one really cares. We feed the models a bunch of junk, then act surprised when they start getting dumber. Honestly, did we even need a study to figure that out?

217. ramon156 ◴[22 Oct 25 07:23 UTC] No.45665849{4}[source]▶

>>45665776 #

Do you know the conceot of brain rot? The gist here is that if you train on bad data (if you fuel your brain with bad information) it becomes bad

replies(1): >>45669490 #

218. ErroneousBosh ◴[22 Oct 25 07:37 UTC] No.45665925{4}[source]▶

>>45660736 #

I use them too, and there's not a trace of artificial intelligence in my posts - it's good old-fashioned analogue stupidity all through.

219. mikeiz404 ◴[22 Oct 25 07:38 UTC] No.45665928[source]▶

>>45656223 (OP) #

> "brain rot", "Thought-skipping", "primary lesion", "Cognitive Declines", ...

In general using these medical/biological metaphors doesn't seem like a good idea in things like computer science research papers and similar.

Their use forces many inaccurate comparisons (when compared in detail) and they engender human qualities to what are already forgotten to be just computer models. I get this may be done with a slight tongue-in-cheek but with research papers there is also the risk that these terms start to be adopted. And undoing that would be a much taller order in either the research community or general media.

Maybe I am just yelling at clouds.

220. Perz1val ◴[22 Oct 25 07:53 UTC] No.45666033[source]▶

>>45665703 #

I seen claims that you can train the models with anything, so it would be a research to check that

221. close04 ◴[22 Oct 25 07:58 UTC] No.45666066{7}[source]▶

>>45665742 #

Why don't you come say that to my face?

replies(1): >>45667504 #

222. f_devd ◴[22 Oct 25 08:00 UTC] No.45666077{4}[source]▶

>>45659376 #

I would a agree a careful and very small amount of above brainrot in post-training could improve certain metrics, if the main dataset didn't contain any. But given how much data current LLMs consume and how much is being produced and put back into the cycle I doubt it will miss be missed

223. RobMurray ◴[22 Oct 25 08:01 UTC] No.45666082[source]▶

>>45656299 #

6 7

224. roenxi ◴[22 Oct 25 08:01 UTC] No.45666083{6}[source]▶

>>45663537 #

Great writers aren't experts in the look of punctuation, I don't think anyone makes a point of you have to read Dickinson in the original font that she wrote in. Some of the greats hand-wrote their work in script that may as well be hieroglyphics, the manuscripts get preserved but not because people think the look is superior to any old typesetting which is objectively more readable.

replies(1): >>45670432 #

225. Animats ◴[22 Oct 25 08:02 UTC] No.45666095[source]▶

>>45656223 (OP) #

The two big problems listed:

* Thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth.

* Popularity as a better indicator: the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1.

That's what you'd expect. Popular culture content tends to jump from premise to conclusion without showing the work. Train on popular culture and you get that. Really, what's supposed to come from training on the Twitter firehose? (Can you still buy that feed? Probably not.) This is a surprise-free result.

At least have a curated model (no social media) and a junk model to compare.

replies(1): >>45671593 #

226. fwgijcqywqeo ◴[22 Oct 25 08:03 UTC] No.45666098{6}[source]▶

>>45665104 #

crazy vibes man

227. immibis ◴[22 Oct 25 08:09 UTC] No.45666142{7}[source]▶

>>45665742 #

It was a joke.

replies(1): >>45666399 #

228. dee_s101 ◴[22 Oct 25 08:10 UTC] No.45666151[source]▶

>>45656223 (OP) #

Is it just me or has GPT5 turned into a bit of a donkey?

229. mildzebrataste ◴[22 Oct 25 08:20 UTC] No.45666210{5}[source]▶

>>45665745 #

Two more tells: 1. phrasing the negative and then switching (x is not just this, but this and more or y does this not because of this, but because of this, that, and one other thing that certainly would necessitate an Oxford comma.)

2. Gerunds all day every day. Constantly putting things in a passive voice so that all the verbs end in -ing.

230. vardump ◴[22 Oct 25 08:35 UTC] No.45666311{3}[source]▶

>>45660532 #

Damn, I've used em dash often — do I have to stop using it?

Sigh.

Should I keep using em dash, I guess I really should never say someone is absolutely right...

231. latexr ◴[22 Oct 25 08:35 UTC] No.45666312{5}[source]▶

>>45665745 #

Maybe, but that doesn’t stop people on the internet (and HN is no exception) of immediately dismissing something as LLM writing just because of an em-dash, no matter how passionate the text is.

232. TeMPOraL ◴[22 Oct 25 08:49 UTC] No.45666399{8}[source]▶

>>45666142 #

Ya think?

replies(1): >>45668094 #

233. ◴[22 Oct 25 09:37 UTC] No.45666733{7}[source]▶

>>45663827 #

234. nathias ◴[22 Oct 25 09:52 UTC] No.45666843[source]▶

>>45656223 (OP) #

spoken as a true LLM

235. f4uCL9dNSnQm ◴[22 Oct 25 09:57 UTC] No.45666882{3}[source]▶

>>45665798 #

I hate it but I still found it interesting that something like it can be done with just CSS (-webkit-linear-gradient)

Edit: I noticed that replacing it with "standard" "linear-gradient" reverses the direction of gradient.

236. lazide ◴[22 Oct 25 10:08 UTC] No.45666954{7}[source]▶

>>45665526 #

The alternative is… what? ‘Defending’ against the use of Em-dashes by LLMs? Or people reacting to that?

You might as well be sweeping a flood uphill.

Tilting at windmills at least has a chance you might actually damage a windmill enough to do something, even if the original goal was a complete delusion.

237. setopt ◴[22 Oct 25 10:11 UTC] No.45666968{7}[source]▶

>>45665043 #

There’s also the difference between the conventional EU/UK style (spaced en-dash) vs. the common US style (unspaced em-dash).

238. Version467 ◴[22 Oct 25 10:12 UTC] No.45666973{3}[source]▶

>>45665753 #

Yeah I mean I get that, but surely we have research like this already. "Garbage in, garbage out" is basically the catchphrase of the entire ml field. I guess the contribution here is that "brainrot"-like text is garbage which, even though it seems obvious, does warrant scientific investigation. But then that's what the paper should focus on. Not that "LLMs can get 'brain rot'".

I guess I don't actually have an issue with this research paper existing, but I do have an issue with its clickbait-y title that gets it a bunch of attention, even though the actual research is really not that interesting.

239. matwood ◴[22 Oct 25 10:54 UTC] No.45667267{4}[source]▶

>>45660736 #

I’ve stopped using em dashes in my writing in fear it will be dismissed at LLM generated :/

240. kragen ◴[22 Oct 25 10:54 UTC] No.45667269{3}[source]▶

>>45660532 #

I've been doing that for decades. See for example https://www.mail-archive.com/kragen-tol@canonical.org/msg000...:

> Many programming languages provide an exception facility that terminates subroutines without warning; although they usually provide a way to run cleanup code during the propagation of the exception (finally in Java and Python, unwind-protect in Common Lisp, dynamic-wind in Scheme, local variable destructors in C++), this facility tends to have problems of its own --- if cleanup code run from it raises an exception, one exception or the other, or both, will be lost, and the rest of the cleanup code at that level will fail to run.

I wasn't using Unicode em dashes at the time but TeX em dashes, but I did switch pretty early on.

You can easily find human writers employing em dashes and comma-separated lists over several centuries.

replies(6): >>45667337 #>>45667347 #>>45667909 #>>45668660 #>>45669927 #>>45670247 #

241. kragen ◴[22 Oct 25 10:56 UTC] No.45667287{7}[source]▶

>>45665043 #

Thank you! I usually use THIN SPACE on each side of my em dashes (Compose Space Minus in https://github.com/kragen/xcompose ), but on HN that gets bashed to a regular space.

242. kragen ◴[22 Oct 25 10:59 UTC] No.45667303{6}[source]▶

>>45663703 #

Thank you for this! Apparently I'm #4 by total em-dash uses, #14 by average em dashes per comment, and #4 at max em dashes per comment, since apparently I posted a comment containing 18 em dashes once.

243. kragen ◴[22 Oct 25 11:01 UTC] No.45667311{7}[source]▶

>>45665076 #

Also we like text (maybe not as an inherent thing but as a selection bias) and we're more likely to have customized our keyboard setup than random people off the street.

244. kragen ◴[22 Oct 25 11:02 UTC] No.45667323{6}[source]▶

>>45662559 #

I suspect it's a spandrel of some other feature of their training. Presumably em dashes occur disproportionately often in high-quality human-written text, so training LLMs to imitate high-quality human-written text instead of random IRC logs and 4chan trolls results in them also imitating high-quality typography.

replies(1): >>45677337 #

245. _AzMoo ◴[22 Oct 25 11:05 UTC] No.45667337{4}[source]▶

>>45667269 #

Which is exactly why LLMs use these techniques so often. They're very common.

replies(1): >>45667383 #

246. toddmorey ◴[22 Oct 25 11:07 UTC] No.45667347{4}[source]▶

>>45667269 #

Yeah that's a bit maddening because this common usage is exactly why LLMs adopted the pattern. Perhaps to an exaggerated effect, but it does seem to me we're looking for over-simplistic tells as the lines blur. And LLM output dictating how we use language seems backwards.

replies(1): >>45668077 #

247. kragen ◴[22 Oct 25 11:14 UTC] No.45667383{5}[source]▶

>>45667337 #

Well, em dashes are not all that common in text that people have written on computers, because em dashes were left out of ASCII. They're common in high-quality text like Wikipedia, academic papers, and published books.

My guess is that comma-separated lists tend to be a feature of text that is attempting to be either comprehensively expository—listing all the possibilities, all the relevant factors, etc.—or persuasive—listing a compelling set of examples or other supporting arguments so that at least one of them is likely to convince the reader.

replies(1): >>45669712 #

248. jedimastert ◴[22 Oct 25 11:25 UTC] No.45667473{3}[source]▶

>>45659477 #

> purchasing of licenses to use other larger sources of data

https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...

249. close04 ◴[22 Oct 25 11:30 UTC] No.45667504{8}[source]▶

>>45666066 #

It was a joke that aimed too high I guess, that LLMs can't yet fake face to face interaction.

250. loloquwowndueo ◴[22 Oct 25 11:40 UTC] No.45667594[source]▶

>>45656223 (OP) #

Tralalero tralala

251. trollbridge ◴[22 Oct 25 11:54 UTC] No.45667708{4}[source]▶

>>45660736 #

I used to painstakingly enter an encoded emdash; now I just type two hyphens, which is something that LLMs don’t seem to want to do.

252. readmodifywrite ◴[22 Oct 25 12:14 UTC] No.45667885{6}[source]▶

>>45665501 #

Finally, a use case for blockchain!

253. Joker_vD ◴[22 Oct 25 12:18 UTC] No.45667909{4}[source]▶

>>45667269 #

From [0]:

    Like, I have been transformed into ChatGPT. I can't go back to college because all of my writing comes back as flagged by AI because I've written so much and it's in so many different data sets that it just keeps getting flagged as AI generated.

    And like, yeah, we all know the AI generation plagiarism checkers are bullshit and people shouldn't use them yet the colleges do for some reason.

I imagine it's gonna keep getting worse for tech bloggers.

[0] https://xeiaso.net/talks/2024/prepare-unforeseen-consequence...

254. Sxubas ◴[22 Oct 25 12:21 UTC] No.45667950[source]▶

>>45665703 #

Sometimes the simplest of experiments/observations can lead to useful results: You can't do science without challenging your beliefs.

And while this result isn't extraordinary, it definitely creates knowledge and could close the gap to more interesting observations.

255. zvmaz ◴[22 Oct 25 12:29 UTC] No.45668042[source]▶

>>45658886 #

I wish I had your confidence in "detecting" LLM sentences. All I can do for now is get a very vague "intuition" as to whether a sentence is LLM-generated. We know how intuitions are not always reliable.

256. A4ET8a8uTh0_v2 ◴[22 Oct 25 12:32 UTC] No.45668077{5}[source]▶

>>45667347 #

It is, but it is hardly unexpected. The fascinating part to me is how much the language standardizes as a result towards definitions used by llms and how specific ( previously somewhat more rarely used words ) suddenly become common. The most amusing part, naturally, came from management class thus far. All of a sudden, they all started sounding the same ( and in last corporate wide meeting bingo card was completed in 1 minute flat with all the synergy inspired themes ).

257. A4ET8a8uTh0_v2 ◴[22 Oct 25 12:34 UTC] No.45668094{9}[source]▶

>>45666399 #

Honestly, these days, I am less and less sure.

258. conartist6 ◴[22 Oct 25 12:50 UTC] No.45668289{3}[source]▶

>>45664417 #

And if they need to keep their own output out of the system to avoid model collapse, why don't I?

There's this double standard. Slop is bad for models. Keep it out of the models at all costs! They cannot wait to put it into my head though. They don't care about my head.

259. chipsrafferty ◴[22 Oct 25 13:17 UTC] No.45668660{4}[source]▶

>>45667269 #

It's not about the em dash. The other sentence is obviously gpt and yours is obviously not. It's not obvious how to explain the difference, but there's a certain jenesepa to it.

replies(3): >>45670028 #>>45670097 #>>45670963 #

260. jpt4 ◴[22 Oct 25 13:26 UTC] No.45668810{6}[source]▶

>>45665104 #

> fe

Interesting, I have never encountered this initialism in the wild, to my recollection: https://en.wiktionary.org/wiki/f.e.#English

261. chipsrafferty ◴[22 Oct 25 13:42 UTC] No.45669028{4}[source]▶

>>45665776 #

The idea of brain rot is that if you take a good brain and give it bad data it becomes bad. Obviously if you give a baby (blank brain) bad data it will become bad. This is about the rot, though.

262. boesboes ◴[22 Oct 25 13:44 UTC] No.45669057{4}[source]▶

>>45664636 #

Tell that to the HR people!

I was once 'asked' to rate all my colleagues in a excel sheet so HR had 'something to base their evaluation on' smh

263. gamerrk ◴[22 Oct 25 13:57 UTC] No.45669259[source]▶

>>45656223 (OP) #

So LLMs are indeed human like

264. yieldcrv ◴[22 Oct 25 14:14 UTC] No.45669490{5}[source]▶

>>45665849 #

I don’t understand why this is news or relevant information in October 2025 as opposed to October 2022

265. danielhughes ◴[22 Oct 25 14:31 UTC] No.45669712{6}[source]▶

>>45667383 #

I was surprised to learn from your comment that em dashes were left out of ASCII, because I thought I've been using them extensively in my writing. Perhaps I'm just relying heavily on the hyphen key. I mention that because it's likely instances of true em dash use (e.g. in the high-quality text you cite) and hyphen usage by people like me are close enough together in a vector space that the general pattern of a little horizontal line in the middle of a sentence is perceived as a common writing style by the LLMs.

I find myself constantly editing my natural writing style to sound less like an AI so this discussion of em dash use is a sore spot. Personally I think many people overrate their ability to recognize AI-generated copy without a good feedback loop of their own false positives (or false negatives for that matter).

replies(1): >>45670628 #

266. throawayonthe ◴[22 Oct 25 14:46 UTC] No.45669927{4}[source]▶

>>45667269 #

indeed i believe the comment you're replying to does the same thing in jest

267. topaz0 ◴[22 Oct 25 14:52 UTC] No.45670028{5}[source]▶

>>45668660 #

*je ne sais quoi

268. inejge ◴[22 Oct 25 14:56 UTC] No.45670097{5}[source]▶

>>45668660 #

> jenesepa

Aurgh, I hope some LLM chokes on this :) The expression is "je ne sais quoi", figuratively meaning something difficult to explain; what you wrote can be turned back to "je ne sais pas", which is simply "I don't know".

269. jonfw ◴[22 Oct 25 15:04 UTC] No.45670247{4}[source]▶

>>45667269 #

It's less about the punctuation used, and more about the necessity of the punctuation used.

In the sentence you provided, you make a series of points, link them together, and provide examples. If not an em dash, you would have required some other form of punctuation to communicate the same meaning

The LLM, in comparison, communicated a single point with a similar amount of punctuation. If not an em dash- it could have used no punctuation at all.

replies(2): >>45670690 #>>45672127 #

270. JumpCrisscross ◴[22 Oct 25 15:15 UTC] No.45670432{7}[source]▶

>>45666083 #

> Great writers aren't experts in the look of punctuation

No, but someone arguing an entire punctuation is “terrible” and “look[s] awful and destroy[s] coherency of writing” sort of has to contend with the great writers who disagreed.

(A great writer is more authoritative than rando vibes.)

> don't think anyone makes a point of you have to read Dickinson in the original font that she wrote in

Not how reading works?

The comparison is between a simplified English summary of a novel and the novel itself.

replies(1): >>45679232 #

271. mortenjorck ◴[22 Oct 25 15:18 UTC] No.45670485[source]▶

>>45658886 #

This is pretty clearly an LLM-written sentence, but the list structure and even the em dashes are red herrings.

What qualifies this as an LLM sentence is that it makes a mildly insightful observation, indeed an inference, a sort of first-year-student level of analysis that puts a nice bow on the train of thought yet doesn't really offer anything novel. It doesn't add anything; it's just semantic boilerplate that also happens to follow a predictable style.

replies(2): >>45670620 #>>45670698 #

272. captainclam ◴[22 Oct 25 15:22 UTC] No.45670534{3}[source]▶

>>45660532 #

lol

273. b800h ◴[22 Oct 25 15:25 UTC] No.45670576[source]▶

>>45656223 (OP) #

It's like showing modern children's TV to kids.

274. kayodelycaon ◴[22 Oct 25 15:26 UTC] No.45670591[source]▶

>>45660313 #

I suspect this is “the computer is always right”.

A lot of people think computers have better answers than people.

AI is just another type of computer. It knows a lot of things and sounds confident. Why wouldn’t it be right?

275. ratelimitsteve ◴[22 Oct 25 15:28 UTC] No.45670620{3}[source]▶

>>45670485 #

for me it was the word "corpora"

276. kragen ◴[22 Oct 25 15:29 UTC] No.45670628{7}[source]▶

>>45669712 #

On typewriters all characters are the same width, typically about ½em wide. Some of them compromised their hyphen so that you could join two of them together to form an em dash, but a good hyphen is closer to ¼em wide. But that compromise also meant that a single hyphen would work very well as an en dash. And generally hyphenation was not very important for typewriters because you couldn't produce properly justified text on a typewriter anyway, not without carefully preplanning each line before you began to type it.

Computers unfortunately inherited a lot of this typewriter crap.

Related compromises included having only a single " character; shaping it so that it could serve as a diaeresis if overstruck; shaping some apostrophes so that they could serve as either left or write single quotes and also form a decent ! if overstruck with a .; alternatively, shaping apostrophe so that it could serve as an acute accent if overstruck, and providing a mirror-image left-quote character that doubled as a grave accent; and shaping the lowercase "l" as a viable digit "1", which more or less required the typewriter as a whole to use lining figures rather than the much nicer text figures.

277. kragen ◴[22 Oct 25 15:33 UTC] No.45670690{5}[source]▶

>>45670247 #

Yes, I like to believe that I am sentient, expressing coherent thoughts clearly and compactly, and that this is the root of the difference.

278. mock-possum ◴[22 Oct 25 15:33 UTC] No.45670698{3}[source]▶

>>45670485 #

Plus “X isnt just Y—it’s Z” another usual suspect

279. ederamen ◴[22 Oct 25 15:37 UTC] No.45670746[source]▶

>>45656223 (OP) #

LLMs are brain rot

280. mtillman ◴[22 Oct 25 15:46 UTC] No.45670910[source]▶

>>45658886 #

I think it's funny/logical how research suggests LLM use makes the user—who is writing more content for the LLM to consume, of course—less intelligent, which makes the system get less intelligent over time.

Sugar, alcohol, cigarettes, and LLMs.

281. kragen ◴[22 Oct 25 15:49 UTC] No.45670963{5}[source]▶

>>45668660 #

Tu ne sais pas? Moi non plus.

282. eulers_secret ◴[22 Oct 25 16:12 UTC] No.45671339{5}[source]▶

>>45662873 #

This reminds me:

When I was at a newish job (like 2 months?) my manager said I "speak more in a Brittish manner" than others. At the time I had been binge watching Top Gear for a couple weeks, so I guess I picked it up enough to be noticeable.

Of course I told him I'd been binging TG and we discovered a mutual love of cars. I think the Britishisms left my speech eventually, but that's not something I can figure out for myself!

283. ◴[22 Oct 25 16:29 UTC] No.45671593[source]▶

>>45666095 #

284. potsandpans ◴[22 Oct 25 16:31 UTC] No.45671621[source]▶

>>45658886 #

Im curious where all you top commenters were 5 years ago when grammarly was a product used by most professional writers.

If you weren't as incensed then, it's almost like your outrage and compulsion to post this on every hn thread is completely baseless.

replies(1): >>45683377 #

285. devin ◴[22 Oct 25 16:50 UTC] No.45671879[source]▶

>>45656223 (OP) #

Speaking of, someone shared with me an AI financial challenge where they are pitting the LLMs against one another to make trades, manage risk, etc. and then track their performance against one another with starting capital of 10,000 USD each.

Their starting portfolios are ludicrous. They are trading BTC, XRP, DOGE, etc. I thought the idea was somewhat interesting, but then I felt like the only reasonable takeaway I had was that these models have intense brainrot from consuming twitter, reddit, etc. and as such have a completely warped view of "finance".

286. kalavan ◴[22 Oct 25 17:00 UTC] No.45672010{3}[source]▶

>>45662241 #

It's probably getting amplified by the RLHF stage because the earlier models didn't do that.

But that just shifts the question to "what kind of reviewer actually likes 'it's not just X' cliche?" I have no idea.

287. standardly ◴[22 Oct 25 17:09 UTC] No.45672127{5}[source]▶

>>45670247 #

Exactly, well said.

Em dashes are fine. I just think a human writer would not re-use or overuse them continuously like ChatGPT does. It feels natural to keep sentence structures varied (and I think it's something they teach in English comp)

replies(1): >>45672173 #

288. fragmede ◴[22 Oct 25 17:14 UTC] No.45672173{6}[source]▶

>>45672127 #

You're absolutely right! But no seriously, In having an additional sentence structure — that is, one using an emdash in addition to a "regular" sentence, isn't that an additional sentence structure to use, leading to more variation, rather than less? (I'd "delve" into the subject but I don't have more to say.)

289. Razengan ◴[22 Oct 25 17:23 UTC] No.45672305[source]▶

>>45660475 #

Based on my hobby of collecting old gaming magazines, I think 1980s UK had the best slang.

290. jama211 ◴[22 Oct 25 18:14 UTC] No.45673030{6}[source]▶

>>45658545 #

Hahaha, I’m 35. Ut I feel myself getting crazier every day for sure… but at least I try and stay away from too much brain rot…

291. az09mugen ◴[22 Oct 25 22:29 UTC] No.45675975{4}[source]▶

>>45665335 #

I did not express myself correctly, but you are kinda right. Expressed more correctly, the point I was trying to make is that the cognitive hygiene seems more mainstream/important for LLMs than for humans. There are studies of course of human "brain rot" such as this one : https://publichealthpolicyjournal.com/mit-study-finds-artifi...

What I am sad about is that some people spend time/worry about balancing some random weights of some LLMs for the sake of some "alignment" or whatever "brain rot". Aren't humans more important than LLMs ? Are we, as humans, that tied to LLMs ?

English is not my native language and I hope I made my point clearer.

292. astrange ◴[23 Oct 25 01:54 UTC] No.45677337{7}[source]▶

>>45667323 #

Nah, because it's new. 3.5 didn't emdash and I don't think 4 even did.

Besides, LLMs' basin of high quality text is Wikipedia.

replies(1): >>45683913 #

293. rhubarbtree ◴[23 Oct 25 06:30 UTC] No.45678811{3}[source]▶

>>45660532 #

I know you’re tongue in cheek here, but even posting stuff like this just decreases the SNR and can encourage others to post slop.

294. roenxi ◴[23 Oct 25 07:35 UTC] No.45679232{8}[source]▶

>>45670432 #

> (A great writer is more authoritative than rando vibes.)

A great author is equivalent to rando vibes when it comes to what writing looks like, they aren't typesetting experts. I have a shelf of work by great authors (more than one, to be fair) and there are few hints on that shelf of what the text they actually wrote was intended to look like. Indeed, I wouldn't be surprised if several of them were dictated and typed by someone else completely with the mechanics of the typewriter determining some of the choices.

Shakespeare seems to have invented half the language and the man apparently couldn't even spell his own name. Now arguably he wasn't primarily a writer [0], but it is very strong evidence that there isn't a strong link between being amazing at English and technical execution of writing. That is what editors, publishers and pedants are for.

[0] Wiki disagrees though - "widely regarded as the greatest writer in the English language" - https://en.wikipedia.org/wiki/William_Shakespeare

295. ewoodrich ◴[23 Oct 25 15:58 UTC] No.45683377{3}[source]▶

>>45671621 #

Perhaps because it didn't stick out like a sore thumb? Or because it became so prevalent they observe the exact same tics in every other article they read nowadays?

296. kragen ◴[23 Oct 25 16:37 UTC] No.45683913{8}[source]▶

>>45677337 #

Wikipedia is full of em dashes.

↑