LLMs can get "brain rot"

> train on bad data, get a bad model

Right: in the context of supervised learning, this statement is a good starting point. After all, how can one build a good supervised model if you can't train it on good examples?

But even in that context, it isn't an incisive framing of the problem. Lots of supervised models are resilient to some kinds of error. A better question, I think, is: what kinds of errors at what prevalence tend to degrade performance and why?

Speaking of LLMs and their ingestion processing, there is a lot more going on than purely supervised learning, so it seems reasonable to me that researchers would want to try to tease the problem apart.