←back to thread

255 points ColinWright | 1 comments | | HN request time: 0.215s | source
Show context
bakql ◴[] No.45775259[source]
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #
Calavar ◴[] No.45775392[source]
I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #
hsbauauvhabzb ◴[] No.45775489[source]
How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.
replies(5): >>45775544 #>>45775575 #>>45775693 #>>45775841 #>>45775924 #
nkrisc ◴[] No.45775841[source]
Put your content behind authentication if you don’t want it to be requested by just anyone.
replies(1): >>45776712 #
kelnos ◴[] No.45776712[source]
But I do want my content accessible to "just anyone", as long as they are humans. I don't want it accessible to bots.

You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

replies(4): >>45778293 #>>45778316 #>>45778370 #>>45781869 #
1. 9rx ◴[] No.45778370[source]
> as long as they are humans. I don't want it accessible to bots.

A curious position. There isn't a secondary species using the internet. There is only humans. Unless you foresee some kind of alien invasion or earthworm uprising, nothing other than humans will ever access your content. Rejecting the tools humans use to bridge their biological gaps is rather nonsensical.

> You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

I suppose it would be pretty neat if humans were born with some kind of internet-like telepathy ability, but lacking that mechanism isn't any kind of real problem. Humans are well adept at using tools and have successfully used tools for millennia. The internet itself is a tool! Which, like before, makes rejecting the human use of tools nonsensical.