Not even close to everything.
E.g. training on the NY Times and Wikipedia has zero meaningful AI. Training on books from reputable publishers similarly has zero meaningful AI. Any LLM usage was to polish prose or assist with research or whatever, but shouldn't affect the factual quality in any significant way.
The web hasn't been polluted with AI any more than e-mail has been polluted with spam. Which is to say it's there, but it's also entirely viable to separate. Nobody's worried that the group email chain with friends is being overrun with spam or with AI.
It's not about a heuristic on text of unknown provenance -- it's about publishers that exert a certain level of editorial control and quality verification. Or social reputation mechanisms that achieve the same.
That's what is preventing your "model collapse". Reputations of provenance. Not pure-text heuristics.