This is something I have a hard time understanding. What is the point of this aggressive crawling? Gathering training data? Don't we already have massive repos of scraped web data being used for search indexing? Is this a coordination issue, each little AI startup having to scrape its own data because nobody is willing to share their stuff as regular dumps? For Wikipedia we have the official offline downloads, for books we have books3, but there's not an equivalent for the rest of the web? Could this be solved by some system where website operators submit text copies of their sites to a big database? Then in robots.txt or similar add a line that points to that database with a deep link to their site's mirrored content?
The obvious issues are: a) who would pay to host that database. b) Sites not participating because they don't want their content accessible by LLMs for training (so scraping will still provide an advantage over using the database). c) The people implementing these scrapers are unscrupulous and just won't bother respecting sites that direct them to an existing dumped version of their content. d) Strong opponents to AI will try poisoning the database with fake submissions...
Or does this proposed database basically already exist between Cloudflare and the Internet Archive, and we already know that the scrapers are some combination of dumb and belligerent and refuse to use anything but the live site?