←back to thread

454 points positiveblue | 1 comments | | HN request time: 0s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
davepeck ◴[] No.45068588[source]
> Everyone loves the dream of a free for all and open web. But the reality is how can someone small protect their blog or content from AI training bots?

I'm old enough to remember when people asked the same questions of Hotbot, Lycos, Altavista, Ask Jeeves, and -- eventually -- Google.

Then, as now, it never felt like the right way to frame the question. If you want your content freely available, make it freely available... including to the bots. If you want your content restricted, make it restricted... including to the humans.

It's also not clear to me that AI materially changes the equation, since Google has for many years tried to cut out links to the small sites anyway in favor of instant answers.

(FWIW, the big companies typically do honor robots.txt. It's everyone else that does what they please.)

replies(2): >>45068873 #>>45070106 #
Symbiote ◴[] No.45070106[source]
Google (and the others) crawl from a published IP range, with "Google" in the user agent. They read robots.txt. They are very easy to block

The AI scum companies crawl from infected botnet IPs, with the user agent the same as the latest Chrome or Safari.

replies(2): >>45070475 #>>45074321 #
davepeck ◴[] No.45070475[source]
Okay. Which, specifically, are the "AI scum" companies you're speaking of?

There are plenty of non-AI companies that also use dubiously sourced IPs and hide behind fake User-Agents.

replies(1): >>45073210 #
1. Symbiote ◴[] No.45073210[source]
I don't know which companies, of course. They hide their identity by using a botnet.

This traffic is new, and started around when many AI startups started.

I see traffic from new search engines and other crawlers, but it generally respects robots.txt and identifies itself, or else comes from a small pool of IP addresses.