←back to thread

770 points ta988 | 1 comments | | HN request time: 0.001s | source
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
bodantogat ◴[] No.42551649[source]
I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.
replies(5): >>42551680 #>>42551803 #>>42556117 #>>42558781 #>>42574346 #
kmoser ◴[] No.42556117[source]
My cheap and dirty way of dealing with bots like that is to block any IP address that accesses any URLs in robots.txt. It's not a perfect strategy but it gives me pretty good results given the simplicity to implement.
replies(2): >>42561452 #>>42563185 #
Beijinger ◴[] No.42563185[source]
How can I implement this?
replies(2): >>42564237 #>>42569373 #
1. aorth ◴[] No.42564237{3}[source]
Another related idea: use fail2ban to monitor the server access logs. There is one filter that will ban hosts that request non-existent URLs like WordPress login and other PHP files. If your server is not hosting PHP at all it's an obvious sign that the requests are from bots that are probing maliciously.