←back to thread

212 points CrankyBear | 1 comments | | HN request time: 0.217s | source
Show context
altcognito ◴[] No.45112060[source]
Can I ask a stupid question? Why is this so much worse than what they were doing to gather articles for traditional search engines? I assume that they are gathering pretty much the same data? It is the same articles, no?

— I just realized these are callouts from the LLM on behalf of the client. I can see how this is problematic but it does seem like there should be a way to cache that

replies(2): >>45112386 #>>45138365 #
1. zipy124 ◴[] No.45138365[source]
There's many factors but the largest are that it comes down to the fact there weren't many search companies, and they weren't that well capitalised. This meant there wasn't really competition for "freshness" in your results. There are many many many AI companies, and even more AI data companies providing the data to those doing the actual training.

Finally search engines don't actually cache all the text, but do something akin to calculating embeddings/keywords and stuff like pagerank which just uses links. AI companies however want ALL the text/image/video data, and it's too expensive to store this all. It is however cheap to download it every time you need it. (Data ingress is usually free, as opposed to data egress)