(www.theregister.com)

211 points CrankyBear | 1 comments | 02 Sep 25 16:24 UTC | HN request time: 0.209s | source

Show context

altcognito ◴[03 Sep 25 03:42 UTC] No.45112060[source]▶

Can I ask a stupid question? Why is this so much worse than what they were doing to gather articles for traditional search engines? I assume that they are gathering pretty much the same data? It is the same articles, no?

— I just realized these are callouts from the LLM on behalf of the client. I can see how this is problematic but it does seem like there should be a way to cache that

replies(2): >>45112386 #>>45138365 #

1. splitbrain ◴[03 Sep 25 05:00 UTC] No.45112386[source]▶

>>45112060 #

No, the traffic is not caused by client requests (like when your chat gpt session does a search and checks some sources). They are caused by training runs. The difference is that AI companies are not storing the data they scrape. They let the model ingest the data, then throw it away. When they train the next model, they scrape the entire Internet again. At least that's how I understand it.

↑

AI web crawlers are destroying websites in their never-ending content hunger