←back to thread

770 points ta988 | 9 comments | | HN request time: 0.963s | source | bottom
Show context
walterbell ◴[] No.42551009[source]
OpenAI publishes IP ranges for their bots, https://github.com/greyhat-academy/lists.d/blob/main/scraper...

For antisocial scrapers, there's a Wordpress plugin, https://kevinfreitas.net/tools-experiments/

> The words you write and publish on your website are yours. Instead of blocking AI/LLM scraper bots from stealing your stuff why not poison them with garbage content instead? This plugin scrambles the words in the content on blog post and pages on your site when one of these bots slithers by.

replies(6): >>42551078 #>>42551167 #>>42551217 #>>42551446 #>>42551777 #>>42564313 #
brookst ◴[] No.42551078[source]
The latter is clever but unlikely to do any harm. These companies spend a fortune on pre-training efforts and doubtlessly have filters to remove garbage text. There are enough SEO spam pages that just list nonsense words that they would have to.
replies(5): >>42551122 #>>42551337 #>>42551547 #>>42552581 #>>42562028 #
1. rickyhatespeas ◴[] No.42551547[source]
It will do harm to their own site considering it's now un-indexable on platforms used by hundreds of millions and growing. Anyone using this is just guaranteeing that their content will be lost to history at worst, or just inaccessible to most search engines/users at best. Congrats on beating the robots, now every time someone searches for your site they will be taken straight to competitors.
replies(4): >>42551624 #>>42551689 #>>42552139 #>>42553241 #
2. walterbell ◴[] No.42551624[source]
> now every time someone searches for your site they will be taken straight to competitors

There are non-LLM forms of distribution, including traditional web search and human word of mouth. For some niche websites, a reduction in LLM-search users could be considered a positive community filter. If LLM scraper bots agree to follow longstanding robots.txt protocols, they can join the community of civilized internet participants.

replies(1): >>42551801 #
3. scrollaway ◴[] No.42551689[source]
Indeed, it's like dumping rotting trash all over your garden and saying "Ha! Now Jehovah's witnesses won't come here anymore".
replies(1): >>42552040 #
4. knuppar ◴[] No.42551801[source]
Exactly. Not every website needs to be at the top of SEO (or LLM-O?). Increasingly the niche web feels nicer and nicer as centralized platforms expand.
5. jonnycomputer ◴[] No.42552040[source]
No, its like building a fence because your neighbors' dogs keep shitting in your yard and never clean it up.
6. luckylion ◴[] No.42552139[source]
You can still fine-tune though. I often run User-Agent: *, Disallow: / with User-Agent: Googlebot, Allow: / because I just don't care for Yandex or baidu to crawl me for the 1 user/year they'll send (of course this depends on the region you're offering things to).

That other thing is only a more extreme form of the same thing for those who don't behave. And when there's a clear value proposition in letting OpenAI ingest your content you can just allow them to.

7. blibble ◴[] No.42553241[source]
I'd rather no-one read it and die forgotten than help "usher in the AI era"
replies(1): >>42563433 #
8. int_19h ◴[] No.42563433[source]
Then why bother with a website at all?
replies(1): >>42578492 #
9. lanstin ◴[] No.42578492{3}[source]
I put my own recipes up so when I am shopping I can get the ingredients list. Sometimes we pull it up while cooking on a tablet.