←back to thread

454 points positiveblue | 2 comments | | HN request time: 0.001s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
gausswho ◴[] No.45066945[source]
What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.
replies(8): >>45067035 #>>45067135 #>>45067195 #>>45067518 #>>45067718 #>>45067723 #>>45068361 #>>45068809 #
quectophoton ◴[] No.45067195[source]
I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).

[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....

replies(3): >>45067379 #>>45067381 #>>45068696 #
1. gausswho ◴[] No.45067381{3}[source]
To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...

What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.

Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.

replies(1): >>45067725 #
2. quectophoton ◴[] No.45067725[source]
Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.

If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.

If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".

Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.

I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c