(pod.geraspora.de)

770 points ta988 | 1 comments | 30 Dec 24 14:37 UTC | HN request time: 0s | source

Show context

walterbell ◴[30 Dec 24 16:51 UTC] No.42551009[source]▶

OpenAI publishes IP ranges for their bots, https://github.com/greyhat-academy/lists.d/blob/main/scraper...

For antisocial scrapers, there's a Wordpress plugin, https://kevinfreitas.net/tools-experiments/

> The words you write and publish on your website are yours. Instead of blocking AI/LLM scraper bots from stealing your stuff why not poison them with garbage content instead? This plugin scrambles the words in the content on blog post and pages on your site when one of these bots slithers by.

replies(6): >>42551078 #>>42551167 #>>42551217 #>>42551446 #>>42551777 #>>42564313 #

GaggiX ◴[30 Dec 24 17:12 UTC] No.42551217[source]▶

>>42551009 #

I imagine these companies today are curing their data with LLMs, this stuff isn't going to do anything.

replies(4): >>42551300 #>>42551409 #>>42552071 #>>42552243 #

1. sangnoir ◴[30 Dec 24 18:27 UTC] No.42552071[source]▶

>>42551217 #

> I imagine these companies today are curing their data with LLMs, this stuff isn't going to do anything

The same LLMs tag are terrible at AI-generated-content detection? Randomly mangling words may be a trivially detectable strategy, so one should serve AI-scraper bots with LLM-generated doppelganger content instead. Even OpenAI gave up on its AI detection product

↑

AI companies cause most of traffic on forums