←back to thread

255 points ColinWright | 2 comments | | HN request time: 0.616s | source
Show context
bakql ◴[] No.45775259[source]
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #
Calavar ◴[] No.45775392[source]
I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #
whimsicalism ◴[] No.45775674[source]
There's an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.
replies(3): >>45775900 #>>45777031 #>>45783664 #
1. hdgvhicv ◴[] No.45777031[source]
Based on the comments here the polite world of the internet where people obeyed unwritten best practices is certainly over in favour of “grab what you can might makes right”
replies(1): >>45777710 #
2. whimsicalism ◴[] No.45777710[source]
that was never the internet. the old internet was “information wants to be free, good luck if you want to restrict my access or resharing”