←back to thread

454 points positiveblue | 3 comments | | HN request time: 0.001s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
wvenable ◴[] No.45067955[source]
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?

Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.

replies(8): >>45067998 #>>45068139 #>>45068376 #>>45068589 #>>45068929 #>>45069170 #>>45073712 #>>45074969 #
1. alemanek ◴[] No.45069170[source]
And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.

Why is outsourcing this to Cloudflare bad and doing it yourself ok? Am I allowed to buy a license to a rate limiter or do I need to code my own? Am I allowed to use a firewall or is blocking people from probing my server not free enough?

Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point

replies(1): >>45074461 #
2. diggan ◴[] No.45074461[source]
> And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.

Where are people getting this from? No, Cloudflare or any other CDN is not required for you to host your own stuff. Sure, it's easy, and probably the best way to go if you just wanna focus on shipping, but lets not pretend it's a requirement today.

> Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point

I don't think they are, that's why we have rate limiters, right? :) I think the point is that if you're allowing a user to access some content in one way, why not allow that same user to access the content in the same way, but using a different user-agent? That's the original purpose of that header after all, to signal what the user used as an agent on their behalf. Commonly, I use Firefox as my agent for browsing, but I should be free to use any user-agent, if we want the web to remain open and free.

replies(1): >>45075021 #
3. alemanek ◴[] No.45075021[source]
My point is that people choose to outsource the complexity of running a rate limiter and blocking bad actors to Cloudflare and others like them is not the issue you make it out to be.

Why is it good for me to do it myself but bad to pay Cloudflare $20 a month to do it for me. No one is forcing me to use their services. I still have the option to do it myself, or use someone else, or not use anything at all. Seems pretty free to me.

Many AI scraping bots are notoriously bad actors and are hammering sites. Please don’t pretend they are all or even mostly well behaved. We didn’t have this push with the search engine scraping bots as those were mostly well behaved.

You are setting up a straw man with a “hey why not let this hypothetical we’ll behaved bot in”. That isn’t the argument or reality. We didn’t have the need to block Google, Yahoo, or Bings bot because they respected robots.txt and had a reasonable frequency of visits.