←back to thread

454 points positiveblue | 1 comments | | HN request time: 0.001s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
wvenable ◴[] No.45067955[source]
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?

Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.

replies(8): >>45067998 #>>45068139 #>>45068376 #>>45068589 #>>45068929 #>>45069170 #>>45073712 #>>45074969 #
BrenBarn ◴[] No.45068929[source]
No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.

It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".

It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.

This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.

replies(4): >>45068997 #>>45072168 #>>45073489 #>>45090949 #
RobSm ◴[] No.45072168[source]
You can always stop bots. Add login/password. But people want their content to be accessible to as large audience as possible, but at the same time they don't want that data to be accessible to the same audience via other channels. logic. Bots are not consuming your data - humans are. At the end of the day humans will eventually read it and take actions. For example chatgpt will mention your site, the user will visit it.

And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.

replies(1): >>45072689 #
lucumo ◴[] No.45072689[source]
> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.

I'm sorry, but this statement shows you have no recent experience with these crawlernets.

Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.

Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)

That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.

And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.

A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.

I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.

In the end I had to firewall nearly 3 million non-consecutive IP addresses.

So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.

replies(2): >>45075397 #>>45091036 #
ijk ◴[] No.45075397[source]
I am still a bit confused by what some of these crawlers are getting out of it; repeatedly crawling sites that haven't changed seems to be the norm for the current crawlernets, which seems like a massive waste of resources on their end for what is, on average, data of rather indifferent quality.
replies(1): >>45075409 #
1. immibis ◴[] No.45075409[source]
Nothing. They're not designed to be useful. They're designed to grab as much data as possible and they'll figure out what to do with it later - they don't know it's mostly useless yet.

Tarpits are cool.