AI web crawlers are destroying websites in their never-ending content hunger

(www.theregister.com)

211 points CrankyBear | 2 comments | 02 Sep 25 16:24 UTC | HN request time: 0.489s | source

Show context

krunck ◴[02 Sep 25 17:17 UTC] No.45106065[source]▶

I just block them by User Agent string[1]. The rest that fake the UA get clobbered by rate limiting[2] on the web server. Not perfect, but our site is not getting hammered any more.

[1] https://perishablepress.com/ultimate-ai-block-list/

[2] https://github.com/jzdziarski/mod_evasive

replies(1): >>45106323 #

braden_e ◴[02 Sep 25 17:32 UTC] No.45106323[source]▶

>>45106065 #

There is a very large scale crawler that uses random valid user agents and a staggeringly large pool of ips. I first noticed it because a lot of traffic was coming from Brazil and "HostRoyale" (asn 203020). They send only a few requests a day from each ip so rate limiting is not useful.

I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.

I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.

replies(3): >>45106389 #>>45107318 #>>45107468 #

kjkjadksj ◴[02 Sep 25 17:37 UTC] No.45106389[source]▶

>>45106323 #

I wonder if you could implement a dummy rate limit? Half the time you are rate limited randomly. A real user will think nothing of it and refresh the page.

replies(1): >>45107142 #

1. ronsor ◴[02 Sep 25 18:30 UTC] No.45107142[source]▶

>>45106389 #

That will irritate real users half the time while the bots won't care.

replies(1): >>45118561 #

2. kjkjadksj ◴[03 Sep 25 17:45 UTC] No.45118561[source]▶

>>45107142 (TP) #

If they are a real user going on your site in 2025 then they have no alternative they are even interested in. They will blame their ISP and wait.

Meanwhile rate limiting the llm could potentially cost a lot of money in time and compute to people who don’t have our best interests at heart. Seems like a win to me.

↑