←back to thread

211 points CrankyBear | 2 comments | | HN request time: 0.473s | source
Show context
idle_zealot ◴[] No.45107787[source]
This is something I have a hard time understanding. What is the point of this aggressive crawling? Gathering training data? Don't we already have massive repos of scraped web data being used for search indexing? Is this a coordination issue, each little AI startup having to scrape its own data because nobody is willing to share their stuff as regular dumps? For Wikipedia we have the official offline downloads, for books we have books3, but there's not an equivalent for the rest of the web? Could this be solved by some system where website operators submit text copies of their sites to a big database? Then in robots.txt or similar add a line that points to that database with a deep link to their site's mirrored content?

The obvious issues are: a) who would pay to host that database. b) Sites not participating because they don't want their content accessible by LLMs for training (so scraping will still provide an advantage over using the database). c) The people implementing these scrapers are unscrupulous and just won't bother respecting sites that direct them to an existing dumped version of their content. d) Strong opponents to AI will try poisoning the database with fake submissions...

Or does this proposed database basically already exist between Cloudflare and the Internet Archive, and we already know that the scrapers are some combination of dumb and belligerent and refuse to use anything but the live site?

replies(2): >>45108157 #>>45108865 #
drozycki ◴[] No.45108157[source]
I asked Google AI Mode “does Google ai mode make tens of site requests for a single prompt” and it showed “Looking at 69 sites” before giving a response about query fan-out.

Cloudflare has some large part of the web cached, IA takes too long to respond and couldn’t handle the load. Google/OpenAI and co could cache these pages but apparently don’t do it aggressively enough or at all

replies(1): >>45108911 #
ccgreg ◴[] No.45108911[source]
I don't think you're correct about Google. Caching webpages is bread-and-butter for search engines, that's how they show snippets.
replies(1): >>45109221 #
danudey ◴[] No.45109221[source]
They might cache it, but what if it changed in the last 30 seconds and now their information is out of date? Better make another request just in case.
replies(1): >>45109471 #
ccgreg ◴[] No.45109471[source]
That's not how search engines work. They have a good idea of which pages might be frequently updated. That's how "news search" works, and even small startup search engines like blekko had news search.
replies(1): >>45110357 #
1. TheServitor ◴[] No.45110357[source]
Indeed. My understanding is that crawl is a real expense at scale so they optimize for "just enough" to catch most site update rhythms and then use other signals (like blog pings, or someone searching for a URL that's not yet crawled, etc) to selectively chase fresher content.
replies(1): >>45113233 #
2. ccgreg ◴[] No.45113233[source]
My experience is that a news crawl is not a big expense at scale, but so far I've only built one and inherited one. BTW No one uses blog pings, the latest hotness is IndexNow.