←back to thread

770 points ta988 | 2 comments | | HN request time: 0s | source
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
devit[dead post] ◴[] No.42551513[source]
[flagged]
aftbit ◴[] No.42551670[source]
Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.

Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.

replies(1): >>42551973 #
1. layer8 ◴[] No.42551973[source]
Read the article, the bots change their User Agent to an innocuous one when they start being blocked.

And having to use Cloudflare is just as bad for the internet as a whole as bots routinely eating up all available resources.

replies(1): >>42568145 #
2. aftbit ◴[] No.42568145[source]
I did read the article. I'm skeptical of the claim though. The author was careful to publish specific UAs for the bots, but then provided no extra information of the non-bot UAs.

>If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

I'm also skeptical of the need for _anyone_ to access the edit history at 10 qps. You could put an nginx rule on those routes that just limits the edit history page to 0.5 qps per IP and 2 qps across all IPs, which would protect your site from both bad AI bots and dumb MediaWiki script kiddies at little impact.

>Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.

And caching would fix this too, especially for pages that are guaranteed not to change (e.g. an edit history diff page).

Don't get me wrong, I'm not unsympathetic to the author's plight, but I do think that the internet is an unsafe place full of bad actors, and a single bad actor can easily cause a lot of harm. I don't think throwing up your arms and complaining is that helpful. Instead, just apply the mitigations that have existed for this for at least 15 years, and move on with your life. Your visitors will be happier and the bots will get boned.