←back to thread

770 points ta988 | 2 comments | | HN request time: 0.572s | source
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
devit[dead post] ◴[] No.42551513[source]
[flagged]
markerz ◴[] No.42551701[source]
Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.

Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.

replies(2): >>42551869 #>>42551889 #
ndriscoll ◴[] No.42551889[source]
I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.
replies(1): >>42552279 #
troupo ◴[] No.42552279[source]
You completely ignore the fact that they are also requesting a lot of pages that can be expensive to retrieve/calculate.
replies(1): >>42552510 #
ndriscoll ◴[] No.42552510[source]
Beyond something like running an ML model, what web pages are expensive (enough that 1-10 requests/second matters at all) to generate these days?
replies(3): >>42552631 #>>42552645 #>>42553639 #
1. x0x0 ◴[] No.42553639[source]
I've worked on multiple sites like this over my career.

Our pages were expensive to generate, so what scraping did is blew out all our caches by yanking cold pages/images into memory. Page caches, fragment caches, image caches, but also the db working set in ram, making every single thing on the site slow.

replies(1): >>42556708 #
2. ◴[] No.42556708[source]