"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.
"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"
Yeah, well…
As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.
As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.
As if technical details were good carriers of human intents.
I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.
I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?
> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way
The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.
Edit:
> you're more than welcome to put an authentication gate around your content.
I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.
As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.
I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.
But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.
I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.
> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.
It is not about denying access to the content to some and allowing access to others.
It is about having to deal with abuses.
Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.
Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.
The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.
In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.
Except that’s not the end of the story.
If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.
That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.