I just block them by User Agent string[1]. The rest that fake the UA get clobbered by rate limiting[2] on the web server. Not perfect, but our site is not getting hammered any more.
replies(1):
I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.
I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.
Meanwhile rate limiting the llm could potentially cost a lot of money in time and compute to people who don’t have our best interests at heart. Seems like a win to me.