AI web crawlers are destroying websites in their never-ending content hunger

(www.theregister.com)

Show context

krunck ◴[02 Sep 25 17:17 UTC] No.45106065[source]▶

I just block them by User Agent string[1]. The rest that fake the UA get clobbered by rate limiting[2] on the web server. Not perfect, but our site is not getting hammered any more.

[1] https://perishablepress.com/ultimate-ai-block-list/

[2] https://github.com/jzdziarski/mod_evasive

replies(1): >>45106323 #

1. braden_e ◴[02 Sep 25 17:32 UTC] No.45106323[source]▶

>>45106065 #

There is a very large scale crawler that uses random valid user agents and a staggeringly large pool of ips. I first noticed it because a lot of traffic was coming from Brazil and "HostRoyale" (asn 203020). They send only a few requests a day from each ip so rate limiting is not useful.

I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.

I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.

replies(3): >>45106389 #>>45107318 #>>45107468 #

2. kjkjadksj ◴[02 Sep 25 17:37 UTC] No.45106389[source]▶

>>45106323 (TP) #

I wonder if you could implement a dummy rate limit? Half the time you are rate limited randomly. A real user will think nothing of it and refresh the page.

replies(1): >>45107142 #

3. ronsor ◴[02 Sep 25 18:30 UTC] No.45107142[source]▶

>>45106389 #

That will irritate real users half the time while the bots won't care.

replies(1): >>45118561 #

4. candlemas ◴[02 Sep 25 18:44 UTC] No.45107318[source]▶

>>45106323 (TP) #

My site has also recently been getting massively hit by Brazilian IPs. It lasts for a day or two, even if they are being blocked.

5. dizlexic ◴[02 Sep 25 18:55 UTC] No.45107468[source]▶

>>45106323 (TP) #

I've written my own bots that do exactly this. My reason was mainly to avoid detection so as part of that I also severely throttled my requests and hit the target at random intervals. In other words, I wasn't trying to abuse them. I just didn't want them to notice me.

TLDR it's trivial to send fake info when you're the one who controls the info.

6. kjkjadksj ◴[03 Sep 25 17:45 UTC] No.45118561{3}[source]▶

>>45107142 #

If they are a real user going on your site in 2025 then they have no alternative they are even interested in. They will blame their ISP and wait.

Meanwhile rate limiting the llm could potentially cost a lot of money in time and compute to people who don’t have our best interests at heart. Seems like a win to me.

↑