AI web crawlers are destroying websites in their never-ending content hunger

(www.theregister.com)

211 points CrankyBear | 2 comments | 02 Sep 25 16:24 UTC | HN request time: 0.473s | source

Show context

idle_zealot ◴[02 Sep 25 19:21 UTC] No.45107787[source]▶

This is something I have a hard time understanding. What is the point of this aggressive crawling? Gathering training data? Don't we already have massive repos of scraped web data being used for search indexing? Is this a coordination issue, each little AI startup having to scrape its own data because nobody is willing to share their stuff as regular dumps? For Wikipedia we have the official offline downloads, for books we have books3, but there's not an equivalent for the rest of the web? Could this be solved by some system where website operators submit text copies of their sites to a big database? Then in robots.txt or similar add a line that points to that database with a deep link to their site's mirrored content?

The obvious issues are: a) who would pay to host that database. b) Sites not participating because they don't want their content accessible by LLMs for training (so scraping will still provide an advantage over using the database). c) The people implementing these scrapers are unscrupulous and just won't bother respecting sites that direct them to an existing dumped version of their content. d) Strong opponents to AI will try poisoning the database with fake submissions...

Or does this proposed database basically already exist between Cloudflare and the Internet Archive, and we already know that the scrapers are some combination of dumb and belligerent and refuse to use anything but the live site?

replies(2): >>45108157 #>>45108865 #

drozycki ◴[02 Sep 25 19:51 UTC] No.45108157[source]▶

>>45107787 #

I asked Google AI Mode “does Google ai mode make tens of site requests for a single prompt” and it showed “Looking at 69 sites” before giving a response about query fan-out.

Cloudflare has some large part of the web cached, IA takes too long to respond and couldn’t handle the load. Google/OpenAI and co could cache these pages but apparently don’t do it aggressively enough or at all

replies(1): >>45108911 #

ccgreg ◴[02 Sep 25 20:54 UTC] No.45108911[source]▶

>>45108157 #

I don't think you're correct about Google. Caching webpages is bread-and-butter for search engines, that's how they show snippets.

replies(1): >>45109221 #

danudey ◴[02 Sep 25 21:21 UTC] No.45109221[source]▶

>>45108911 #

They might cache it, but what if it changed in the last 30 seconds and now their information is out of date? Better make another request just in case.

replies(1): >>45109471 #

ccgreg ◴[02 Sep 25 21:42 UTC] No.45109471[source]▶

>>45109221 #

That's not how search engines work. They have a good idea of which pages might be frequently updated. That's how "news search" works, and even small startup search engines like blekko had news search.

replies(1): >>45110357 #

1. TheServitor ◴[02 Sep 25 23:17 UTC] No.45110357[source]▶

>>45109471 #

Indeed. My understanding is that crawl is a real expense at scale so they optimize for "just enough" to catch most site update rhythms and then use other signals (like blog pings, or someone searching for a URL that's not yet crawled, etc) to selectively chase fresher content.

replies(1): >>45113233 #

2. ccgreg ◴[03 Sep 25 07:43 UTC] No.45113233[source]▶

>>45110357 (TP) #

My experience is that a news crawl is not a big expense at scale, but so far I've only built one and inherited one. BTW No one uses blog pings, the latest hotness is IndexNow.

↑