I can understand why LLM companies might want to crawl those diffs -- it's context. Assuming that we've trained LLM on all the low hanging fruit, building a training corpus that incorporates the way a piece of text changes over time probably has some value. This doesn't excuse the behavior, of course.
Back in the day, Google published the sitemap protocol to alleviate some crawling issues. But if I recall correctly, that was more about helping the crawlers find more content, not controlling the impact of the crawlers on websites.
replies(2):