AI companies cause most of traffic on forums

(pod.geraspora.de)

770 points ta988 | 1 comments | 30 Dec 24 14:37 UTC | HN request time: 0s | source

Show context

buro9 ◴[30 Dec 24 17:36 UTC] No.42551470[source]▶

>>42549624 (OP) #

Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

replies(9): >>42551536 #>>42551667 #>>42551719 #>>42551781 #>>42551798 #>>42551877 #>>42552584 #>>42552786 #>>42565241 #

pogue ◴[30 Dec 24 17:52 UTC] No.42551667[source]▶

>>42551470 #

What do you use to block them?

replies(1): >>42551696 #

buro9 ◴[30 Dec 24 17:54 UTC] No.42551696[source]▶

>>42551667 #

Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

replies(2): >>42552020 #>>42555075 #

gs17 ◴[30 Dec 24 18:23 UTC] No.42552020[source]▶

>>42551696 #

From the article:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.

replies(1): >>42564061 #

Libcat99 ◴[01 Jan 25 04:55 UTC] No.42564061[source]▶

>>42552020 #

Switching to sending wrong, inexpensive data might be preferable to blocking them.

I've used this with voip scanners.

replies(1): >>42564776 #

buro9 ◴[01 Jan 25 08:43 UTC] No.42564776[source]▶

>>42564061 #

Oh I did this with the Facebook one and redirected them to a 100MB file of garbage that is part of the Cloudflare speed test... they hit this so many times that it would've been 2PB sent in a matter of hours.

I contacted the network team at Cloudflare to apologise and also to confirm whether Facebook did actually follow the redirect... it's hard for Cloudflare to see 2PB, that kind of number is too small on a global scale when it's occurred over a few hours, but given that it was only a single PoP that would've handled it, then it would've been visible.

It was not visible, which means we can conclude that Facebook were not following redirects, or if they were, they were just queuing it for later and would only hit it once and not multiple times.

replies(1): >>42572442 #

1. tliltocatl ◴[02 Jan 25 07:16 UTC] No.42572442[source]▶

>>42564776 #

Hmm, what about 1kb of carefully crafted gz-bomb? Or a TCP tarpit (this one would be a bit difficult to deploy).

↑