AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 2 comments | 31 Oct 25 15:44 UTC | HN request time: 0.443s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

Calavar ◴[31 Oct 25 18:53 UTC] No.45775392[source]▶

>>45775259 #

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #

bigbuppo ◴[31 Oct 25 20:07 UTC] No.45776143[source]▶

>>45775392 #

Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.

replies(1): >>45776208 #

jMyles ◴[31 Oct 25 20:14 UTC] No.45776208[source]▶

>>45776143 #

Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

replies(5): >>45776310 #>>45776406 #>>45777226 #>>45778865 #>>45780805 #

LexGray ◴[31 Oct 25 22:08 UTC] No.45777226[source]▶

>>45776208 #

Perhaps bad taste, but bots could also be legitimately purposely violating the most private or traumatizing moments a vulnerable person has in any exploitative way it cares to. I am not sure using bad taste is enough of an excuse to not discuss the issue as many people do in fact use the internet for sexual things. If anything consent should be MORE important because it is easier to document and verify.

A vast hoard of personal information exists and most of it never had or will have proper consent, knowledge, or protection.

replies(1): >>45782435 #

1. jMyles ◴[01 Nov 25 15:30 UTC] No.45782435[source]▶

>>45777226 #

> the most private or traumatizing moments a vulnerable person has

...and in this hypothetical, this person is serving them via an unauthenticated http server and hoping that clients will respect robots.txt?

replies(1): >>45783403 #

2. bigbuppo ◴[01 Nov 25 17:16 UTC] No.45783403[source]▶

>>45782435 (TP) #

Robots are supposed to behave. It was a solved problem 30 years ago until AI bros unsolved it. Any entity that does not obey robots.txt is by definition a malicious actor.

↑