←back to thread

454 points positiveblue | 1 comments | | HN request time: 0.286s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
gausswho ◴[] No.45066945[source]
What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.
replies(8): >>45067035 #>>45067135 #>>45067195 #>>45067518 #>>45067718 #>>45067723 #>>45068361 #>>45068809 #
qwerty456127 ◴[] No.45067718[source]
What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.
replies(2): >>45067851 #>>45068339 #
Symbiote ◴[] No.45068339[source]
> I see zero reasons to oppose robots visiting any website I would build.

> preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go.

What will you do when the bots ignore your instructions, and send a million requests a day to these URLs from half a million different IP addresses?

replies(2): >>45068643 #>>45069784 #
1. ianbutler ◴[] No.45068643[source]
Let my site go down and then restart my server a few hours later. I'm a dude with a blog I'm not making uptime guarantees. I think you're overestimating the harm and how often this happens.

Misbehaving scrapers have been a problem for years not just from AI. I've written posts on how to properly handle scraping and the legal grey area it puts you in and how to be a responsible one. If companies don't want to be responsible the solution isn't abdicate an open web. It's make better law and enforcement of said law.