AI web crawlers are destroying websites in their never-ending content hunger

1. giancarlostoro ◴[02 Sep 25 17:27 UTC] No.45106227[source]▶

I'm not sure why they don't just cache the websites and avoid going back for at least 24 hours, especially in the case of most sites. I swear its like we're re-learning software engineering basics with LLMs / AI and it kills me.

replies(8): >>45106404 #>>45106430 #>>45106554 #>>45107000 #>>45107104 #>>45107170 #>>45107187 #>>45112971 #

2. kpw94 ◴[02 Sep 25 17:38 UTC] No.45106404[source]▶

>>45106227 (TP) #

Yeah the landscpe when there were many more Search engines must have been exactly the same...

I think the eng teams behind those were just more competent / more frugal on their processing.

And since there wasn't any AWS equivalent, they had to be better citizens since well-known IP range ban for the crawled websites was trivial.

replies(3): >>45106664 #>>45107746 #>>45109209 #

3. add-sub-mul-div ◴[02 Sep 25 17:40 UTC] No.45106430[source]▶

>>45106227 (TP) #

The people at the forefront of creating the shortcut machine are taking shortcuts. We're on a slow march towards the death of attention to detail.

replies(1): >>45106836 #

4. jsheard ◴[02 Sep 25 17:47 UTC] No.45106554[source]▶

>>45106227 (TP) #

Once the crawler goes up, who cares what it brings down?

That's not my department! says Crawler von Braun

replies(1): >>45106791 #

5. ccgreg ◴[02 Sep 25 17:55 UTC] No.45106664[source]▶

>>45106404 #

The blekko search engine index was only 1 billion pages, compared to Common Crawl Foundation's crawl of 3 billion webpages per month.

6. zwirbl ◴[02 Sep 25 18:03 UTC] No.45106791[source]▶

>>45106554 #

That's gold, I've just stumbled on the original a week ago

7. giancarlostoro ◴[02 Sep 25 18:06 UTC] No.45106836[source]▶

>>45106430 #

Slow march? It feels like we've been on that train a while honestly. It's embarrassing. We don't even have fully native GUIs they're all browser wrappers.

8. robwwilliams ◴[02 Sep 25 18:19 UTC] No.45107000[source]▶

>>45106227 (TP) #

This! Today I asked Claude Sonnet to read the Wikipedia article on “inference” and answer a few of my questions.

Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.

Still no cogent answer. Pathetic. Very much an Anthropic blindspot—to the point of being at least amoral and even immoral.

Do the big AI corporation that have profited greatly from Wikimedia Foundation give anything back? Or are they just large internet blood suckers without ethics?

Dario and Sam et al.: Contribute to the welfare of your own blood donors.

replies(3): >>45107107 #>>45107151 #>>45109716 #

9. gowld ◴[02 Sep 25 18:28 UTC] No.45107104[source]▶

>>45106227 (TP) #

Who says they don't?

10. lawlessone ◴[02 Sep 25 18:28 UTC] No.45107107[source]▶

>>45107000 #

you can even torrent all of wikipedia, and a whole bunch of other wikis.

Would be great if they did that and maybe seeded it too.

11. giancarlostoro ◴[02 Sep 25 18:31 UTC] No.45107151[source]▶

>>45107000 #

> Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.

Even worse when you consider that you can download all of Wikipedia for offline use...

12. numpad0 ◴[02 Sep 25 18:32 UTC] No.45107170[source]▶

>>45106227 (TP) #

imo when it kills somebody it justifies extreme means such as feeding them with fabricated truths such as LLM generated and artificially corrupted text /s

13. immibis ◴[02 Sep 25 18:34 UTC] No.45107187[source]▶

>>45106227 (TP) #

It's because they don't give a shit whether the product works properly or not. By blocking AI scraping, sites are forcing AI companies to scrape faster before they're blocked. And faster means sloppier.

replies(1): >>45107489 #

14. lovich ◴[02 Sep 25 18:56 UTC] No.45107489[source]▶

>>45107187 #

There’s also the point that if the web site is down after you scraped it, then that’s 1 more sites data you’ve scraped that your competition now cant

15. acdha ◴[02 Sep 25 19:18 UTC] No.45107746[source]▶

>>45106404 #

Bandwidth cost more then, so the early search engines had an inventive not to massively increase their own costs if nothing else.

16. danudey ◴[02 Sep 25 21:20 UTC] No.45109209[source]▶

>>45106404 #

It's worth noting that search engines back then (and now? except the AI ones) generally tended to follow robots.txt, which meant that if there were heavy areas of your site that you didn't want them to index you could filter them out and let them just follow static pages. You could block off all of /cgi-bin/ for example and then they would never be hitting your CGI scripts - useful if your guestbook software wrote out static files to be served, for example.

The search engines were also limited in resources, so they were judicious about what they fetched, when, and how often; optimizing their own crawlers saved them money, and in return it also saved the websites too. Even with a hundred crawlers actively indexing your site, they weren't going to index it more than, say, once a day, and 100 requests in a day isn't really that much even back then.

Now, companies are pumping billions of dollars into AI; budgets are infinite, limits are bypassed, and norms are ignored. If the company thinks it can benefit from indexing your site 30 times a minute then it will, but even if it doesn't benefit from it there's no reason for them to stop it from doing so because it doesn't cost them anything. They cannot risk being anything other than up-to-date, because if users are coming to you asking about current events and why space force is moving to Alabama and your AI doesn't know but someone else's does, then you're behind the times.

So in the interests of maximizing short-term profit above all else - which is the only thing AI companies are doing in any way shape or form - they may as well scrape every URL on your site once per second, because it doesn't cost them anything and they don't care if you go bankrupt and shut down.

17. 8organicbits ◴[02 Sep 25 22:03 UTC] No.45109716[source]▶

>>45107000 #

> Then I asked it why

I'm still learning the landscape of LLMs, but do we expect an LLM to be able to answer that? I didn't think they had meta information about their own operation.

replies(1): >>45123850 #

18. benhurmarcel ◴[03 Sep 25 06:54 UTC] No.45112971[source]▶

>>45106227 (TP) #

I guess they prefer paying for bandwidth rather than storage

19. kldg ◴[04 Sep 25 05:22 UTC] No.45123850{3}[source]▶

>>45109716 #

your understanding is correct.