←back to thread

770 points ta988 | 2 comments | | HN request time: 0s | source
Show context
markerz ◴[] No.42551173[source]
One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

replies(14): >>42551260 #>>42551410 #>>42551412 #>>42551513 #>>42551649 #>>42551742 #>>42552017 #>>42552046 #>>42552437 #>>42552763 #>>42555123 #>>42562686 #>>42565119 #>>42572754 #
bodantogat ◴[] No.42551649[source]
I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.
replies(5): >>42551680 #>>42551803 #>>42556117 #>>42558781 #>>42574346 #
echelon ◴[] No.42551803[source]
You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

Maybe you don't even need a full LLM. Just a simple transformer that inverts negative and positive statements, changes nouns such as locations, and subtly nudges the content into an erroneous state.

replies(5): >>42551837 #>>42551968 #>>42552052 #>>42553499 #>>42553755 #
llm_trw ◴[] No.42552052[source]
You will be burning through thousands of dollars worth of compute to do that.
replies(2): >>42552088 #>>42571494 #
1. lazide ◴[] No.42571494[source]
The biggest issue is at least 80% of internet users won’t be capable of passing the test.
replies(1): >>42590074 #
2. araes ◴[] No.42590074[source]
Agree. The bots are already significantly better at passing almost every supposed "Are You Human?" test than the actual humans. "Can you find the cars in this image?" Bots are already better. "Can you find the incredibly convoluted text in this color spew?" Bots are already better. Almost every test these days is the same "These don't make me feel especially 'human'. Not even sure what that's an image of. Are there even letters in that image?"

Part of the issue, the humans all behaved the same way previously. Just slower.

All the scraping, and web downloading. Humans have been doing that for a long time. Just slower.

It's the same issue with a lot of society. Mean, hurtful humans, made mean hurtful bots.

Always the same excuses too. Company / researchers make horrible excrement, knowing full well its going harm everybody on the world wide web. Then claim they had no idea. "Thoughts and prayers."

The torture that used to exist on the world wide web of copy-pasta pages and constant content theft, is now just faster copy-pasta pages and content theft.