←back to thread

211 points CrankyBear | 1 comments | | HN request time: 0.209s | source
Show context
altcognito ◴[] No.45112060[source]
Can I ask a stupid question? Why is this so much worse than what they were doing to gather articles for traditional search engines? I assume that they are gathering pretty much the same data? It is the same articles, no?

— I just realized these are callouts from the LLM on behalf of the client. I can see how this is problematic but it does seem like there should be a way to cache that

replies(2): >>45112386 #>>45138365 #
1. splitbrain ◴[] No.45112386[source]
No, the traffic is not caused by client requests (like when your chat gpt session does a search and checks some sources). They are caused by training runs. The difference is that AI companies are not storing the data they scrape. They let the model ingest the data, then throw it away. When they train the next model, they scrape the entire Internet again. At least that's how I understand it.