AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 4 comments | 31 Oct 25 15:44 UTC | HN request time: 0s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

Calavar ◴[31 Oct 25 18:53 UTC] No.45775392[source]▶

>>45775259 #

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #

hsbauauvhabzb ◴[31 Oct 25 19:01 UTC] No.45775489[source]▶

>>45775392 #

How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.

replies(5): >>45775544 #>>45775575 #>>45775693 #>>45775841 #>>45775924 #

nkrisc ◴[31 Oct 25 19:36 UTC] No.45775841[source]▶

>>45775489 #

Put your content behind authentication if you don’t want it to be requested by just anyone.

replies(1): >>45776712 #

kelnos ◴[31 Oct 25 21:07 UTC] No.45776712[source]▶

>>45775841 #

But I do want my content accessible to "just anyone", as long as they are humans. I don't want it accessible to bots.

You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

replies(4): >>45778293 #>>45778316 #>>45778370 #>>45781869 #

1. nkrisc ◴[01 Nov 25 14:23 UTC] No.45781869[source]▶

>>45776712 #

Even abusive crawlers and scrapers are acting as agents of real humans, just as your browser is acting as your agent. I don't even know how you could reliably draw a reasonable line in the sand between the two without putting some group of people on the wrong side of the line.

I suppose the ultimate solution would be browsers and operating systems and hardware manufacturers co-operating to implement some system that somehow cryptographically signs HTTP requests which attests that it was triggered by an actual, physical interaction with a computing device by a human.

Though you don't have to think for very long to come up with all kinds of collateral damage that would cause and how bad actors could circumvent it anyway.

All in all, this whole issue seems more like a legal problem than a technical one.

replies(1): >>45783535 #

2. bigbuppo ◴[01 Nov 25 17:33 UTC] No.45783535[source]▶

>>45781869 (TP) #

Or the AI people could just stop being abusive jerks. That's an even easier solution.

replies(2): >>45784066 #>>45784705 #

3. 9rx ◴[01 Nov 25 18:31 UTC] No.45784066[source]▶

>>45783535 #

While that is probably good advice in general, the earlier commenter wanted even the abusive jerks to have access to his content.

He just doesn't want tools humans use to access content to be used in association with his content.

What he failed to realize is that if you eliminate the tools, the human cannot access the content anyway. They don't have the proper biological interfaces. Had he realized that, he'd have come to notice that simply turning off his server fully satisfies the constraints.

4. nkrisc ◴[01 Nov 25 19:47 UTC] No.45784705[source]▶

>>45783535 #

That would be easier. Too bad it won't ever happen.

↑