←back to thread

770 points ta988 | 1 comments | | HN request time: 0.281s | source
Show context
BryantD ◴[] No.42551052[source]
I can understand why LLM companies might want to crawl those diffs -- it's context. Assuming that we've trained LLM on all the low hanging fruit, building a training corpus that incorporates the way a piece of text changes over time probably has some value. This doesn't excuse the behavior, of course.

Back in the day, Google published the sitemap protocol to alleviate some crawling issues. But if I recall correctly, that was more about helping the crawlers find more content, not controlling the impact of the crawlers on websites.

replies(2): >>42551088 #>>42552029 #
1. herval ◴[] No.42552029[source]
It’s also for the web index they’re all building, I imagine. Lately I’ve been defaulting to web search via chatgpt instead of google, simply because google can’t find anything anymore, while chatgpt can even find discussions on GitHub issues that are relevant to me. The web is in a very, very weird place