"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
At the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I'm prepared to accept the consequences of a "real user" tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that'll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I'm 100% OK with those consequences. For work's website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.
Anybody who doesn't consider typical AI company's webscraping behaviour over the last few years to qualify as "abuse" has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular/respected sites.