On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.
That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.
Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.
Then, all these AI companies could interface directly with that single entity on terms that are agreeable.
However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.
Not even news articles from top 10 news websites from my country are usually indexed there.
If I couldn't easily cut off the majority of that bot volume I probably would've shut down the app entirely.