←back to thread

211 points CrankyBear | 2 comments | | HN request time: 0.489s | source
Show context
krunck ◴[] No.45106065[source]
I just block them by User Agent string[1]. The rest that fake the UA get clobbered by rate limiting[2] on the web server. Not perfect, but our site is not getting hammered any more.

[1] https://perishablepress.com/ultimate-ai-block-list/

[2] https://github.com/jzdziarski/mod_evasive

replies(1): >>45106323 #
braden_e ◴[] No.45106323[source]
There is a very large scale crawler that uses random valid user agents and a staggeringly large pool of ips. I first noticed it because a lot of traffic was coming from Brazil and "HostRoyale" (asn 203020). They send only a few requests a day from each ip so rate limiting is not useful.

I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.

I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.

replies(3): >>45106389 #>>45107318 #>>45107468 #
kjkjadksj ◴[] No.45106389[source]
I wonder if you could implement a dummy rate limit? Half the time you are rate limited randomly. A real user will think nothing of it and refresh the page.
replies(1): >>45107142 #
1. ronsor ◴[] No.45107142[source]
That will irritate real users half the time while the bots won't care.
replies(1): >>45118561 #
2. kjkjadksj ◴[] No.45118561[source]
If they are a real user going on your site in 2025 then they have no alternative they are even interested in. They will blame their ISP and wait.

Meanwhile rate limiting the llm could potentially cost a lot of money in time and compute to people who don’t have our best interests at heart. Seems like a win to me.