Most active commenters
  • quectophoton(4)

←back to thread

454 points positiveblue | 27 comments | | HN request time: 0.001s | source | bottom
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
1. gausswho ◴[] No.45066945[source]
What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.
replies(8): >>45067035 #>>45067135 #>>45067195 #>>45067518 #>>45067718 #>>45067723 #>>45068361 #>>45068809 #
2. notatoad ◴[] No.45067035[source]
It wouldn’t stop anyone. The bots you want to block already operate out of places where those laws wouldn’t be enforced.
replies(2): >>45067279 #>>45091305 #
3. stronglikedan ◴[] No.45067135[source]
It should have the same protections as an EULA, where the crawler is the end user, and crawlers should be required to read it and apply it.
replies(1): >>45069792 #
4. quectophoton ◴[] No.45067195[source]
I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).

[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....

replies(3): >>45067379 #>>45067381 #>>45068696 #
5. qbane ◴[] No.45067279[source]
Then that is a good reason to deny the requests from those IPs
replies(2): >>45067652 #>>45069724 #
6. blibble ◴[] No.45067379[source]
> This means I'd get sued for using a feed reader on Codeberg

you think codeberg would sue you?

replies(1): >>45067519 #
7. gausswho ◴[] No.45067381[source]
To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...

What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.

Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.

replies(1): >>45067725 #
8. Galanwe ◴[] No.45067518[source]
- Moral rules are never really effective

- Legal threats are never really effective

Effective solutions are:

- Technical

- Monetary

I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.

It's more or less the same concept as torrents back in the day.

This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.

replies(1): >>45067753 #
9. quectophoton ◴[] No.45067519{3}[source]
Probably not.

But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.

10. literalAardvark ◴[] No.45067652{3}[source]
I've run a few hundred small domains for various online stores with an older backend that didn't scale very well for crawlers and at some point we started blocking by continent.

It's getting really, really ugly out there.

11. qwerty456127 ◴[] No.45067718[source]
What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.
replies(2): >>45067851 #>>45068339 #
12. edm0nd ◴[] No.45067723[source]
No we dont
13. quectophoton ◴[] No.45067725{3}[source]
Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.

If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.

If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".

Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.

I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c

14. edm0nd ◴[] No.45067753[source]
>This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that.

This seems flawed.

Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.

>But if you want to spam 1,000,000 everyday that becomes prohibitive.

Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.

15. vkou ◴[] No.45067851[source]
> What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build.

Well, I'm glad you speak for the entire Internet.

Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.

16. Symbiote ◴[] No.45068339[source]
> I see zero reasons to oppose robots visiting any website I would build.

> preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go.

What will you do when the bots ignore your instructions, and send a million requests a day to these URLs from half a million different IP addresses?

replies(2): >>45068643 #>>45069784 #
17. ctoth ◴[] No.45068361[source]
The funny thing about the good old WWW is the first two W's stand for world-wide.

So

Which legal teeth?

replies(1): >>45091317 #
18. ianbutler ◴[] No.45068643{3}[source]
Let my site go down and then restart my server a few hours later. I'm a dude with a blog I'm not making uptime guarantees. I think you're overestimating the harm and how often this happens.

Misbehaving scrapers have been a problem for years not just from AI. I've written posts on how to properly handle scraping and the legal grey area it puts you in and how to be a responsible one. If companies don't want to be responsible the solution isn't abdicate an open web. It's make better law and enforcement of said law.

19. lucb1e ◴[] No.45068696[source]
You don't get sued for using a service as it is meant to be used (using an RSS reader on their feed endpoint; cloning repositories that it is their mission to host). It doesn't anger anyone so they wouldn't bother trying to enforce a rule, and secondly it's a fruitless case because the judge would say it's not a reasonable claim they're making

Robots.txt is meant for crawlers, not user agents such as a feed reader or git client

replies(1): >>45069605 #
20. jopsen ◴[] No.45068809[source]
I have the feeling that it's the small players that cause problems.

Dumb bots that don't respect robot.txt or nofollow are the ones trying all combinations of the filters available in your search options and requesting all pages for each such combination.

The number of search pages can easily be exponential in the number of filters you offer.

Bots walking around in these traps, do it because they are dumb. But even a small degenerate bot can send more requests than 1M MAUs.

At least that's my impression of the problem we're sometimes facing.

Signed agents seems like a horrific solution. And many serving the traffic is just better.

21. quectophoton ◴[] No.45069605{3}[source]
I agree with you, generally you can expect good faith to be returned with good faith (but here I want to make heavy emphasis that I only agree on the judge part iff good faith can be assumed and the judge is informed enough to actually be able to make an informed decision).

But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):

> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Quoting the linked `web robots` page[2]:

> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]

("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)

Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.

Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".

The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.

But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.

[1]: https://en.wikipedia.org/wiki/Robots.txt

[2]: https://en.wikipedia.org/wiki/Internet_bot

[3]: https://hn.algolia.com/?dateRange=all&query=automated%20robo...

[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".

22. ◴[] No.45069724{3}[source]
23. immibis ◴[] No.45069784{3}[source]
Sue them / press charges. DDoS is a felony.
24. immibis ◴[] No.45069792[source]
So none at all? EULAs are mostly just meant to intimidate you so you won't exercise your inalienable rights.
replies(1): >>45077206 #
25. majorchord ◴[] No.45077206{3}[source]
I find that extremely hard to believe. Do you have a source?
26. account42 ◴[] No.45091305[source]
If that was the case then I am I getting buttflare-blocked here in the EU.
27. account42 ◴[] No.45091317[source]
Try hosting some illegal-in-the-US content and find out.