←back to thread

770 points ta988 | 10 comments | | HN request time: 0.015s | source | bottom
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
devit[dead post] ◴[] No.42551513[source]
[flagged]
markerz ◴[] No.42551701[source]
Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.

Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.

replies(2): >>42551869 #>>42551889 #
ndriscoll ◴[] No.42551889[source]
I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.
replies(1): >>42552279 #
troupo ◴[] No.42552279[source]
You completely ignore the fact that they are also requesting a lot of pages that can be expensive to retrieve/calculate.
replies(1): >>42552510 #
1. ndriscoll ◴[] No.42552510[source]
Beyond something like running an ML model, what web pages are expensive (enough that 1-10 requests/second matters at all) to generate these days?
replies(3): >>42552631 #>>42552645 #>>42553639 #
2. smolder ◴[] No.42552631[source]
Usually ones that are written in a slow language, do lots of IO to other webservices or databases in a serial, blocking fashion, maybe don't have proper structure or indices in their DBs, and so on. I have seen some really terribly performing spaghetti web sites, and have experience with them collapsing under scraping load. With a mountain of technical debt in the way it can even be challenging to fix such a thing.
replies(1): >>42553238 #
3. troupo ◴[] No.42552645[source]
Run a mediawiki, as described in the post. It's very heavy. Specifically for history I'm guessing it has to re-parse the entire page and do all link and template lookups because previous versions of the page won't be in any cache
replies(1): >>42552696 #
4. ndriscoll ◴[] No.42552696[source]
The original post says it's not actually a burden though; they just don't like it.

If something is so heavy that 2 requests/second matters, it would've been completely infeasible in say 2005 (e.g. a low power n100 is ~20x faster than the athlon xp 3200+ I used back then. An i5-12600 is almost 100x faster. Storage is >1000x faster now). Or has mediawiki been getting less efficient over the years to keep up with more powerful hardware?

replies(1): >>42553809 #
5. ndriscoll ◴[] No.42553238[source]
Even if you're doing serial IO on a single thread, I'd expect you should be able to handle hundreds of qps. I'd think a slow language wouldn't be 1000x slower than something like functional scala. It could be slow if you're missing an index, but then I'd expect the thing to barely run for normal users; scraping at 2/s isn't really the issue there.
6. x0x0 ◴[] No.42553639[source]
I've worked on multiple sites like this over my career.

Our pages were expensive to generate, so what scraping did is blew out all our caches by yanking cold pages/images into memory. Page caches, fragment caches, image caches, but also the db working set in ram, making every single thing on the site slow.

replies(1): >>42556708 #
7. troupo ◴[] No.42553809{3}[source]
Oh, I was a bit off. They also indexed diffs

> And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users.

replies(1): >>42554087 #
8. ndriscoll ◴[] No.42554087{4}[source]
Does MW not store diffs as diffs (I'd think it would for storage efficiency)? That shouldn't really require much computation. Did diffs take 30s+ to render 15-20 years ago?

For what it's worth my kiwix copy of Wikipedia has a ~5ms response time for an uncached article according to Firefox. If I hit a single URL with wrk (so some caching at least with disks. Don't know what else kiwix might do) at concurrency 8, it does 13k rps on my n305 with a 500 us average response time. That's over 20Gbit/s, so basically impossible to actually saturate. If I load test from another computer it uses ~0.2 cores to max out 1Gbit/s. Different code bases and presumably kiwix is a bit more static, but at least provides a little context to compare with for orders of magnitude. A 3 OOM difference seems pretty extreme.

Incidentally, local copies of things are pretty great. It really makes you notice how slow the web is when links open in like 1 frame.

replies(1): >>42557361 #
9. ◴[] No.42556708[source]
10. troupo ◴[] No.42557361{5}[source]
> Different code bases

Indeed ;)

> If I hit a single URL with wrk

But the bots aren't hitting a single URL

As for the diffs...

According to MediaWiki it gzips diffs [1]. So to render a previous version of the page I guess it'd have to unzip and apply all diffs in sequence to render the final version of the page.

And then it depends on how efficient the queries are at fetching etc.

[1] https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture