←back to thread

454 points positiveblue | 1 comments | | HN request time: 0.44s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
arjie ◴[] No.45069623[source]
The dream is real, man. If you want open content on the Internet, it's never been a better time. My blog is open to all - machine or man. And it's hosted on my home server next to me. I don't see why anyone would bother trying to distinguish humans from AI. A human hitting your website too much is no different from an AI hitting your website too much.

I have a robots.txt that tries to help bots not get stuck in loops, but if they want to, they're welcome to. Let the web be open. Slurp up my stuff if you want to.

Amazonbot seems to love visiting my site, and it is always welcome.

replies(3): >>45069684 #>>45069787 #>>45070894 #
danudey ◴[] No.45069787[source]
> I don't see why anyone would bother trying to distinguish humans from AI.

Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.

If AI bots were well-behaved, maintained a consistent user agent, used consistent IP subnets, and respected robots.txt, I wouldn't have a problem with them. You could manage your content filtering however you want (or not at all) and that would be that. Unfortunately at the moment, AI bots do everything they can to bypass any restrictions or blocks or rate limits you put on them; they behave as though they're completely entitled to overload your servers in their quest to train their AI bots so they can make billions of dollars on the new AI craze while giving nothing back to the people whose content they're misappropriating.

replies(2): >>45071522 #>>45075610 #
1. IAmGraydon ◴[] No.45075610[source]
>Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.

You have zero evidence of this actually happening (because it's not happening).