"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.
"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"
Yeah, well…
As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.
As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.
As if technical details were good carriers of human intents.
I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.
I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?
> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way
The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.
Edit:
> you're more than welcome to put an authentication gate around your content.
I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.
It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.
Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.
> It is inherently a cat and mouse game that you CHOOSE to play
No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.