Most active commenters
  • wvenable(6)
  • frm88(3)

←back to thread

454 points positiveblue | 17 comments | | HN request time: 0.026s | source | bottom
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
wvenable ◴[] No.45067955[source]
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?

Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.

replies(8): >>45067998 #>>45068139 #>>45068376 #>>45068589 #>>45068929 #>>45069170 #>>45073712 #>>45074969 #
BrenBarn ◴[] No.45067998[source]
I think that was the point. Everyone loves the dream, but the reality is different.
replies(1): >>45068015 #
wilson090 ◴[] No.45068015[source]
How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.
replies(6): >>45068058 #>>45068155 #>>45068305 #>>45068547 #>>45068621 #>>45068828 #
1. gradstudent ◴[] No.45068058[source]
How is it available for everyone if the AI bots bring down your server?
replies(5): >>45068142 #>>45068202 #>>45068241 #>>45068453 #>>45068709 #
2. sebasvisser ◴[] No.45068142[source]
Build better
3. mikestorrent ◴[] No.45068202[source]
Ultimately, you have to realize that this is a losing battle, unless we have completely draconian control over every piece of silicon. Captchas are being defeated; at this point they're basically just mechanisms to prove you Really Want to Make That Request to the extent that you'll spend some compute time on it, which is starting to become a bit of a waste of electricity and carbon.

Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!

The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.

replies(1): >>45091177 #
4. edoceo ◴[] No.45068241[source]
Everyone can get it from the bots now?
5. ForHackernews ◴[] No.45068453[source]
Rate-limits? Use a CDN? Lots of traffic can be a problem whether it's bots or humans.
replies(1): >>45069881 #
6. wvenable ◴[] No.45068709[source]
Is that really the problem we are discussing? I've had people attack my server and bring it down. But that has nothing to do with being free and open to everyone. A top hacker news post could take my server.
replies(1): >>45069858 #
7. danudey ◴[] No.45069858[source]
Yes, because a top hacker news post takes your server down because a large number of actual humans are looking to gain actual value from your posts. Meanwhile, you stand to benefit from the HN discussion by learning new things and perspectives from the community.

The AI bot assault, on the other hand, is one company (or a few companies) re-fetching the same data over and over again, constantly, in perpetuity, just in case it's changed, all so they can incorporate it into their training set and make money off of it while giving you zero credit and providing zero feedback.

replies(1): >>45070023 #
8. danudey ◴[] No.45069881[source]
You realize this entire thread is about a pitch from a CDN company trying to solve an issue that has presented itself at such a scale that this is the best option they can think of to keep the web alive, right?

"Use a CDN" is not sufficient when these bots are so incredibly poorly behaved, because you're still paying for that CDN and this bad behavior is going to cost you a fortune in CDN costs (or cost the CDN a fortune instead, which is why Cloudflare is suggesting this).

9. wvenable ◴[] No.45070023{3}[source]
But then we get to use those AI tools.

The refrain here comes down not to "AI" but mostly to "the AI bot assault" which is a different thing. Sure lets have an discussion about badly behaved and overzealous web scrapers. As for credit, I've asked AI for it's references and gotten them. If my information is merely mushed into AI training model I'm not sure why I need credit. If you discuss this thread with your friends are you going to give me credit?

replies(2): >>45072211 #>>45072463 #
10. frm88 ◴[] No.45072211{4}[source]
"If you discuss this thread with your friends are you going to give me credit?"

Yes. How else would I enable my friends to look it up for themselves?

replies(1): >>45077584 #
11. tsimionescu ◴[] No.45072463{4}[source]
No, you don't "get to" use the AI tools. You have to buy access to them (beyond some free trials).
replies(1): >>45077577 #
12. wvenable ◴[] No.45077577{5}[source]
Yes. I get to buy access to them. They're providing an expensive to provide service that requires specialized expertise. I don't see the problem with that.
13. wvenable ◴[] No.45077584{5}[source]
6 months from now when you've internalized this entire thread are you even going to remember where you got it from?
replies(1): >>45080671 #
14. frm88 ◴[] No.45080671{6}[source]
Why are you shifting the discussion by adding two new variables (time/memory)?
replies(1): >>45087055 #
15. wvenable ◴[] No.45087055{7}[source]
Because that's how one interacts with AI.
replies(2): >>45087955 #>>45090441 #
16. frm88 ◴[] No.45090441{8}[source]
Yeah. Running out of arguments, are you?
17. account42 ◴[] No.45091177[source]
Agreed, copyright issues need to be solved via legislation and network abuse issues need to be solved by network operators. Trying to run around either only makes the web worse for everyone.