The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else
you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else
also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible
Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.
The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.
AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
Also, they might share the common viewpoint of "it's the internet; suck it up."