"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
However,
(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.
(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)
(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.
Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".
In Germany, it is the law. § 44b UrhG says (translated):
(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.
(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.
(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.
Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).
Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.
Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.
While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.
If your antivirus software hammers the same website several times a second for hours on end, in a way that is indistinguishable from an "AI crawler", then maybe it's really misbehaving and should be stopped from doing so.
Personally, I'm skeptical of blaming everything on AI scrapers. Everything people are complaining about has been happening for decades - mostly by people searching for website vulnerabilities/sensitive info who don't care if they're misbehaving, sometimes by random individuals who want to archive a site or are playing with a crawler and don't see why they should slow them down.
Even the techniques for poisoning aggressive or impolite crawlers are at least 30 years old.
The only thing that seems to have changed is that today's thread is full of people who think they have some sort of human right to access any website by any means possible, including their sloppy vibe-coded crawler. In the past, IIRC, people used to be a little more apologetic about consuming other people's resources and did their best to fly below the radar.
It's my website. I have every right to block anyone at any time for any reason whatsoever. Whether or not your use case is "legitimate" is beside the point.
And just to not leave it merely implied, I don't give a rats ass if that slows down your "innovation." Go away.