AI companies cause most of traffic on forums

1. BryantD ◴[30 Dec 24 16:55 UTC] No.42551052[source]▶

I can understand why LLM companies might want to crawl those diffs -- it's context. Assuming that we've trained LLM on all the low hanging fruit, building a training corpus that incorporates the way a piece of text changes over time probably has some value. This doesn't excuse the behavior, of course.

Back in the day, Google published the sitemap protocol to alleviate some crawling issues. But if I recall correctly, that was more about helping the crawlers find more content, not controlling the impact of the crawlers on websites.

replies(2): >>42551088 #>>42552029 #

2. jsheard ◴[30 Dec 24 16:59 UTC] No.42551088[source]▶

>>42551052 (TP) #

The sitemap protocol does have some features to help avoid unnecessary crawling, you can specify the last time each page was modified and roughly how frequently they're expected to be modified in the future so that crawlers can skip pulling them again when nothing has meaningfully changed.

3. herval ◴[30 Dec 24 18:23 UTC] No.42552029[source]▶

>>42551052 (TP) #

It’s also for the web index they’re all building, I imagine. Lately I’ve been defaulting to web search via chatgpt instead of google, simply because google can’t find anything anymore, while chatgpt can even find discussions on GitHub issues that are relevant to me. The web is in a very, very weird place