I think the eng teams behind those were just more competent / more frugal on their processing.
And since there wasn't any AWS equivalent, they had to be better citizens since well-known IP range ban for the crawled websites was trivial.
Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.
Still no cogent answer. Pathetic. Very much an Anthropic blindspot—to the point of being at least amoral and even immoral.
Do the big AI corporation that have profited greatly from Wikimedia Foundation give anything back? Or are they just large internet blood suckers without ethics?
Dario and Sam et al.: Contribute to the welfare of your own blood donors.
Would be great if they did that and maybe seeded it too.
Even worse when you consider that you can download all of Wikipedia for offline use...
The search engines were also limited in resources, so they were judicious about what they fetched, when, and how often; optimizing their own crawlers saved them money, and in return it also saved the websites too. Even with a hundred crawlers actively indexing your site, they weren't going to index it more than, say, once a day, and 100 requests in a day isn't really that much even back then.
Now, companies are pumping billions of dollars into AI; budgets are infinite, limits are bypassed, and norms are ignored. If the company thinks it can benefit from indexing your site 30 times a minute then it will, but even if it doesn't benefit from it there's no reason for them to stop it from doing so because it doesn't cost them anything. They cannot risk being anything other than up-to-date, because if users are coming to you asking about current events and why space force is moving to Alabama and your AI doesn't know but someone else's does, then you're behind the times.
So in the interests of maximizing short-term profit above all else - which is the only thing AI companies are doing in any way shape or form - they may as well scrape every URL on your site once per second, because it doesn't cost them anything and they don't care if you go bankrupt and shut down.
I'm still learning the landscape of LLMs, but do we expect an LLM to be able to answer that? I didn't think they had meta information about their own operation.