←back to thread

770 points ta988 | 1 comments | | HN request time: 0s | source
Show context
BryantD ◴[] No.42551052[source]
I can understand why LLM companies might want to crawl those diffs -- it's context. Assuming that we've trained LLM on all the low hanging fruit, building a training corpus that incorporates the way a piece of text changes over time probably has some value. This doesn't excuse the behavior, of course.

Back in the day, Google published the sitemap protocol to alleviate some crawling issues. But if I recall correctly, that was more about helping the crawlers find more content, not controlling the impact of the crawlers on websites.

replies(2): >>42551088 #>>42552029 #
1. jsheard ◴[] No.42551088[source]
The sitemap protocol does have some features to help avoid unnecessary crawling, you can specify the last time each page was modified and roughly how frequently they're expected to be modified in the future so that crawlers can skip pulling them again when nothing has meaningfully changed.