←back to thread

770 points ta988 | 1 comments | | HN request time: 0s | source
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
devit[dead post] ◴[] No.42551513[source]
[flagged]
markerz ◴[] No.42551701[source]
Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.

Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.

replies(2): >>42551869 #>>42551889 #
ndriscoll ◴[] No.42551889[source]
I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.
replies(1): >>42552279 #
troupo ◴[] No.42552279[source]
You completely ignore the fact that they are also requesting a lot of pages that can be expensive to retrieve/calculate.
replies(1): >>42552510 #
ndriscoll ◴[] No.42552510[source]
Beyond something like running an ML model, what web pages are expensive (enough that 1-10 requests/second matters at all) to generate these days?
replies(3): >>42552631 #>>42552645 #>>42553639 #
smolder ◴[] No.42552631{4}[source]
Usually ones that are written in a slow language, do lots of IO to other webservices or databases in a serial, blocking fashion, maybe don't have proper structure or indices in their DBs, and so on. I have seen some really terribly performing spaghetti web sites, and have experience with them collapsing under scraping load. With a mountain of technical debt in the way it can even be challenging to fix such a thing.
replies(1): >>42553238 #
1. ndriscoll ◴[] No.42553238{5}[source]
Even if you're doing serial IO on a single thread, I'd expect you should be able to handle hundreds of qps. I'd think a slow language wouldn't be 1000x slower than something like functional scala. It could be slow if you're missing an index, but then I'd expect the thing to barely run for normal users; scraping at 2/s isn't really the issue there.