Most active commenters

ndriscoll(5)
troupo(4)

Popular/hot comments

>>42552510 #

←back to thread

AI companies cause most of traffic on forums

(pod.geraspora.de)

Show context

markerz ◴[30 Dec 24 17:07 UTC] No.42551173[source]▶

>>42549624 (OP) #

One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #

1. jsheard ◴[30 Dec 24 17:46 UTC] No.42551599[source]▶

>>42551513 (TP) #

That's right, getting DDOSed is a skill issue. Just have infinite capacity.

replies(1): >>42551648 #

2. devit ◴[30 Dec 24 17:50 UTC] No.42551648[source]▶

>>42551599 #

DDOS is different from crashing.

And I doubt Facebook implemented something that actually saturates the network, usually a scraper implements a limit on concurrent connections and often also a delay between connections (e.g. max 10 concurrent, 100ms delay).

Chances are the website operator implemented a webserver with terrible RAM efficiency that runs out of RAM and crashes after 10 concurrent requests, or that saturates the CPU from simple requests, or something like that.

replies(2): >>42551678 #>>42553042 #

3. adamtulinius ◴[30 Dec 24 17:51 UTC] No.42551661[source]▶

>>42551513 (TP) #

No normal person has a chance against the capacity of a company like Facebook

replies(1): >>42551850 #

4. aftbit ◴[30 Dec 24 17:52 UTC] No.42551670[source]▶

>>42551513 (TP) #

Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.

Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.

replies(1): >>42551973 #

5. adamtulinius ◴[30 Dec 24 17:52 UTC] No.42551678{3}[source]▶

>>42551648 #

You can doubt all you want, but none of us really know, so maybe you could consider interpreting people's posts a bit more generously in 2025.

6. markerz ◴[30 Dec 24 17:54 UTC] No.42551701[source]▶

>>42551513 (TP) #

Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.

Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.

replies(2): >>42551869 #>>42551889 #

7. Aeolun ◴[30 Dec 24 18:07 UTC] No.42551850[source]▶

>>42551661 #

Anyone can send 10k concurrent requests with no more than their mobile phone.

8. iamacyborg ◴[30 Dec 24 18:09 UTC] No.42551869[source]▶

>>42551701 #

I suspect if the tables were turned and someone managed to crash FB consistently they might not take too kindly to that.

9. ndriscoll ◴[30 Dec 24 18:11 UTC] No.42551889[source]▶

>>42551701 #

I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.

replies(1): >>42552279 #

10. layer8 ◴[30 Dec 24 18:16 UTC] No.42551946[source]▶

>>42551513 (TP) #

The alternative of crawling to a stop isn’t really an improvement.

11. layer8 ◴[30 Dec 24 18:18 UTC] No.42551973[source]▶

>>42551670 #

Read the article, the bots change their User Agent to an innocuous one when they start being blocked.

And having to use Cloudflare is just as bad for the internet as a whole as bots routinely eating up all available resources.

replies(1): >>42568145 #

12. troupo ◴[30 Dec 24 18:49 UTC] No.42552279{3}[source]▶

>>42551889 #

You completely ignore the fact that they are also requesting a lot of pages that can be expensive to retrieve/calculate.

replies(1): >>42552510 #

13. ndriscoll ◴[30 Dec 24 19:20 UTC] No.42552510{4}[source]▶

>>42552279 #

Beyond something like running an ML model, what web pages are expensive (enough that 1-10 requests/second matters at all) to generate these days?

replies(3): >>42552631 #>>42552645 #>>42553639 #

14. smolder ◴[30 Dec 24 19:35 UTC] No.42552631{5}[source]▶

>>42552510 #

Usually ones that are written in a slow language, do lots of IO to other webservices or databases in a serial, blocking fashion, maybe don't have proper structure or indices in their DBs, and so on. I have seen some really terribly performing spaghetti web sites, and have experience with them collapsing under scraping load. With a mountain of technical debt in the way it can even be challenging to fix such a thing.

replies(1): >>42553238 #

15. troupo ◴[30 Dec 24 19:36 UTC] No.42552645{5}[source]▶

>>42552510 #

Run a mediawiki, as described in the post. It's very heavy. Specifically for history I'm guessing it has to re-parse the entire page and do all link and template lookups because previous versions of the page won't be in any cache

replies(1): >>42552696 #

16. ndriscoll ◴[30 Dec 24 19:43 UTC] No.42552696{6}[source]▶

>>42552645 #

The original post says it's not actually a burden though; they just don't like it.

If something is so heavy that 2 requests/second matters, it would've been completely infeasible in say 2005 (e.g. a low power n100 is ~20x faster than the athlon xp 3200+ I used back then. An i5-12600 is almost 100x faster. Storage is >1000x faster now). Or has mediawiki been getting less efficient over the years to keep up with more powerful hardware?

replies(1): >>42553809 #

17. atomt ◴[30 Dec 24 20:18 UTC] No.42553042{3}[source]▶

>>42551648 #

I've seen concurrency in excess of 500 from Metas crawlers to a single site. That site had just moved all their images so all the requests hit the "pretty url" rewrite into a slow dynamic request handler. It did not go very well.

18. ndriscoll ◴[30 Dec 24 20:38 UTC] No.42553238{6}[source]▶

>>42552631 #

Even if you're doing serial IO on a single thread, I'd expect you should be able to handle hundreds of qps. I'd think a slow language wouldn't be 1000x slower than something like functional scala. It could be slow if you're missing an index, but then I'd expect the thing to barely run for normal users; scraping at 2/s isn't really the issue there.

19. x0x0 ◴[30 Dec 24 21:16 UTC] No.42553639{5}[source]▶

>>42552510 #

I've worked on multiple sites like this over my career.

Our pages were expensive to generate, so what scraping did is blew out all our caches by yanking cold pages/images into memory. Page caches, fragment caches, image caches, but also the db working set in ram, making every single thing on the site slow.

replies(1): >>42556708 #

20. troupo ◴[30 Dec 24 21:33 UTC] No.42553809{7}[source]▶

>>42552696 #

Oh, I was a bit off. They also indexed diffs

> And I mean that - they indexed every single diff on every page for every change ever made. Frequently with spikes of more than 10req/s. Of course, this made MediaWiki and my database server very unhappy, causing load spikes, and effective downtime/slowness for the human users.

replies(1): >>42554087 #

21. ndriscoll ◴[30 Dec 24 22:09 UTC] No.42554087{8}[source]▶

>>42553809 #

Does MW not store diffs as diffs (I'd think it would for storage efficiency)? That shouldn't really require much computation. Did diffs take 30s+ to render 15-20 years ago?

For what it's worth my kiwix copy of Wikipedia has a ~5ms response time for an uncached article according to Firefox. If I hit a single URL with wrk (so some caching at least with disks. Don't know what else kiwix might do) at concurrency 8, it does 13k rps on my n305 with a 500 us average response time. That's over 20Gbit/s, so basically impossible to actually saturate. If I load test from another computer it uses ~0.2 cores to max out 1Gbit/s. Different code bases and presumably kiwix is a bit more static, but at least provides a little context to compare with for orders of magnitude. A 3 OOM difference seems pretty extreme.

Incidentally, local copies of things are pretty great. It really makes you notice how slow the web is when links open in like 1 frame.

replies(1): >>42557361 #

22. ◴[31 Dec 24 05:47 UTC] No.42556708{6}[source]▶

>>42553639 #

23. troupo ◴[31 Dec 24 08:29 UTC] No.42557361{9}[source]▶

>>42554087 #

> Different code bases

Indeed ;)

> If I hit a single URL with wrk

But the bots aren't hitting a single URL

As for the diffs...

According to MediaWiki it gzips diffs [1]. So to render a previous version of the page I guess it'd have to unzip and apply all diffs in sequence to render the final version of the page.

And then it depends on how efficient the queries are at fetching etc.

[1] https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture

24. aftbit ◴[01 Jan 25 18:50 UTC] No.42568145{3}[source]▶

>>42551973 #

I did read the article. I'm skeptical of the claim though. The author was careful to publish specific UAs for the bots, but then provided no extra information of the non-bot UAs.

>If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

I'm also skeptical of the need for _anyone_ to access the edit history at 10 qps. You could put an nginx rule on those routes that just limits the edit history page to 0.5 qps per IP and 2 qps across all IPs, which would protect your site from both bad AI bots and dumb MediaWiki script kiddies at little impact.

>Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.

And caching would fix this too, especially for pages that are guaranteed not to change (e.g. an edit history diff page).

Don't get me wrong, I'm not unsympathetic to the author's plight, but I do think that the internet is an unsafe place full of bad actors, and a single bad actor can easily cause a lot of harm. I don't think throwing up your arms and complaining is that helpful. Instead, just apply the mitigations that have existed for this for at least 15 years, and move on with your life. Your visitors will be happier and the bots will get boned.

↑