AI companies cause most of traffic on forums

(pod.geraspora.de)

770 points ta988 | 3 comments | 30 Dec 24 14:37 UTC | HN request time: 0.777s | source

Show context

markerz ◴[30 Dec 24 17:07 UTC] No.42551173[source]▶

One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #

bodantogat ◴[30 Dec 24 17:50 UTC] No.42551649[source]▶

>>42551173 #

I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.

replies(5): >>42551680 #>>42551803 #>>42556117 #>>42558781 #>>42574346 #

echelon ◴[30 Dec 24 18:03 UTC] No.42551803[source]▶

>>42551649 #

You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

Maybe you don't even need a full LLM. Just a simple transformer that inverts negative and positive statements, changes nouns such as locations, and subtly nudges the content into an erroneous state.

replies(5): >>42551837 #>>42551968 #>>42552052 #>>42553499 #>>42553755 #

tyre ◴[30 Dec 24 18:06 UTC] No.42551837[source]▶

>>42551803 #

Their problem is they can’t detect which are bots in the first place. If they could, they’d block them.

replies(1): >>42551903 #

1. echelon ◴[30 Dec 24 18:12 UTC] No.42551903[source]▶

>>42551837 #

Then have the users solve ARC-AGI or whatever nonsense. If the bots want your content, they'll have to solve $3,000 of compute to get it.

replies(1): >>42552140 #

2. Tostino ◴[30 Dec 24 18:35 UTC] No.42552140[source]▶

>>42551903 (TP) #

That only works until The benchmark questions and answers are public. Which they necessarily would be in this case.

replies(1): >>42565206 #

3. EVa5I7bHFq9mnYK ◴[01 Jan 25 10:39 UTC] No.42565206[source]▶

>>42552140 #

Or maybe solve a small sha2(sha2()) leading zeroes challenge, taking ~1 second of computer time. Normal users won't notice, and bots will earn you Bitcoins :)

↑