←back to thread

770 points ta988 | 1 comments | | HN request time: 0.429s | source
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
bodantogat ◴[] No.42551649[source]
I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.
replies(5): >>42551680 #>>42551803 #>>42556117 #>>42558781 #>>42574346 #
kmoser ◴[] No.42556117[source]
My cheap and dirty way of dealing with bots like that is to block any IP address that accesses any URLs in robots.txt. It's not a perfect strategy but it gives me pretty good results given the simplicity to implement.
replies(2): >>42561452 #>>42563185 #
Beijinger ◴[] No.42563185[source]
How can I implement this?
replies(2): >>42564237 #>>42569373 #
1. kmoser ◴[] No.42569373[source]
Too many ways to list here, and implementation details will depend on your hosting environment and other requirements. But my quick-and-dirty trick involves a single URL which, when visited, runs a script which appends "deny from foo" (where foo is the naughty IP address) to my .htaccess file. The URL in question is not publicly listed, so nobody will accidentally stumble upon it and accidentally ban themselves. It's also specifically disallowed in robots.txt, so in theory it will only be visited by bad bots.