←back to thread

454 points positiveblue | 9 comments | | HN request time: 0s | source | bottom
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
wvenable ◴[] No.45067955[source]
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?

Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.

replies(8): >>45067998 #>>45068139 #>>45068376 #>>45068589 #>>45068929 #>>45069170 #>>45073712 #>>45074969 #
BrenBarn ◴[] No.45067998[source]
I think that was the point. Everyone loves the dream, but the reality is different.
replies(1): >>45068015 #
wilson090 ◴[] No.45068015[source]
How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.
replies(6): >>45068058 #>>45068155 #>>45068305 #>>45068547 #>>45068621 #>>45068828 #
epc ◴[] No.45068621{4}[source]
Do the AI training bots provide free access to the distillation of the content they drain from my site repeatedly? Don't they want a free and open web?

I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.

replies(1): >>45068806 #
wvenable ◴[] No.45068806{5}[source]
Is the problem they exist or the problem they are badly accessing your site? Because there are two conflating issues here. If humans or robots are causing you issues, as both can do, that's bad. But that has nothing to do with AI in particular.
replies(1): >>45068860 #
1. epc ◴[] No.45068860{6}[source]
Problem one is they do not honor the conventions of the web and abuse the sites. Problem two is they are taking content for free, distilling it into a product, and limiting access to that product.
replies(1): >>45068895 #
2. wvenable ◴[] No.45068895[source]
Problem one is not specific to AI and not even about AI.

Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.

replies(2): >>45069987 #>>45077326 #
3. Symbiote ◴[] No.45069987[source]
At present, problem one is almost entirely AI companies.
replies(2): >>45070052 #>>45071482 #
4. wvenable ◴[] No.45070052{3}[source]
And a few decades ago, it would have been search engine scrapers instead.
replies(1): >>45071491 #
5. immibis ◴[] No.45071482{3}[source]
There's actually not much evidence of this, since the attack traffic is anonymous.
replies(1): >>45073198 #
6. not2b ◴[] No.45071491{4}[source]
And that problem was largely solved by robots.txt. AI scrapers are ignoring robots.txt and beating the hell out of sites. Small sites that have decades worth of quality information are suffering the most. Many of the scrapers are taking extreme measures to avoid being blocked, like using large numbers of distinct IP addresses (perhaps using botnets).
7. Symbiote ◴[] No.45073198{4}[source]
HN people working in these AI companies have commented to say they do this, and the timing correlates with the rise of AI companies/funding.

I haven't tried to find it in my own logs, but others have said blocking an identifiable AI bot soon led to the same pattern of requests continuing through a botnet.

replies(1): >>45074219 #
8. lostmsu ◴[] No.45074219{5}[source]
Did HN people present evidence?
9. amiga386 ◴[] No.45077326[source]
Problem one _is_ about AI.

It was a similar problem with cryptocurrencies. Out comes some kind of tech thingy, and a million get-rich-quick scammers pop out of the woodwork and start scamming left, right and center. Suddenly everyone's in on the hustle, everyone's cryptomining, or taking over computers and using them for cryptomining, they're setting the world on fire with electricity consumption through the roof just to fight against other people (who they wouldn't need to fight against if they'd just cooperate).

A vision. A gold rush. A massive increase in shitty human behaviour motivated by greed.

And now here we are again with AI. Massive interest. Trillions of dollars being sloshed around, everyone hustling to develop something so they'll get picked and flooded with cash. An enormous pile of deeply unethical and disrespectful behaviour by people who are doing what they're doing because that's where the money is. The AI bubble.