AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 1 comments | 31 Oct 25 15:44 UTC | HN request time: 0s | source

Show context

bakql ◴[31 Oct 25 18:41 UTC] No.45775259[source]▶

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #

Calavar ◴[31 Oct 25 18:53 UTC] No.45775392[source]▶

>>45775259 #

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #

bigbuppo ◴[31 Oct 25 20:07 UTC] No.45776143[source]▶

>>45775392 #

Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.

replies(1): >>45776208 #

jMyles ◴[31 Oct 25 20:14 UTC] No.45776208[source]▶

>>45776143 #

Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

replies(5): >>45776310 #>>45776406 #>>45777226 #>>45778865 #>>45780805 #

jraph ◴[31 Oct 25 20:24 UTC] No.45776310[source]▶

>>45776208 #

> When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"

Yeah, well…

As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.

As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.

As if technical details were good carriers of human intents.

replies(1): >>45776505 #

ryandrake ◴[31 Oct 25 20:46 UTC] No.45776505[source]▶

>>45776310 #

The locked door is a ridiculous analogy when it comes to the open web. Pretty much all "door" analogies are flawed, but sure let's imagine your web server has a door. If you want to actually lock the door, you're more than welcome to put an authentication gate around your content. A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way.

replies(1): >>45776579 #

jraph ◴[31 Oct 25 20:53 UTC] No.45776579[source]▶

>>45776505 #

Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.

replies(2): >>45776881 #>>45776885 #

jack_pp ◴[31 Oct 25 21:25 UTC] No.45776881{3}[source]▶

>>45776579 #

here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.

replies(1): >>45777043 #

jraph ◴[31 Oct 25 21:44 UTC] No.45777043{4}[source]▶

>>45776881 #

> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.

replies(1): >>45779087 #

jack_pp ◴[01 Nov 25 03:42 UTC] No.45779087{5}[source]▶

>>45777043 #

That's like saying you shouldn't have to sanitize your database inputs because you never asked for people to SQL inject your database. This stance is truly mind boggling to me

replies(2): >>45779758 #>>45779855 #

1. catlifeonmars ◴[01 Nov 25 07:30 UTC] No.45779855{6}[source]▶

>>45779087 #

It’s both. You should sanitize your inputs because there are bad actors, but you also categorize attempts to sql inject as abuse and there is legal recourse.

↑