Most active commenters
  • jraph(5)

←back to thread

257 points ColinWright | 15 comments | | HN request time: 0.878s | source | bottom
Show context
bakql ◴[] No.45775259[source]
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #
Calavar ◴[] No.45775392[source]
I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #
bigbuppo ◴[] No.45776143[source]
Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.
replies(1): >>45776208 #
jMyles ◴[] No.45776208[source]
Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

replies(5): >>45776310 #>>45776406 #>>45777226 #>>45778865 #>>45780805 #
jraph ◴[] No.45776310[source]
> When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"

Yeah, well…

As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.

As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.

As if technical details were good carriers of human intents.

replies(1): >>45776505 #
1. ryandrake ◴[] No.45776505[source]
The locked door is a ridiculous analogy when it comes to the open web. Pretty much all "door" analogies are flawed, but sure let's imagine your web server has a door. If you want to actually lock the door, you're more than welcome to put an authentication gate around your content. A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way.
replies(1): >>45776579 #
2. jraph ◴[] No.45776579[source]
Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.

replies(2): >>45776881 #>>45776885 #
3. jack_pp ◴[] No.45776881[source]
here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.

replies(1): >>45777043 #
4. ryandrake ◴[] No.45776885[source]
People need to have a better mental model of what it means to host a public web site, and what they are actually doing when they run the web server and point it at a directory of files. They're not just serving those files to customers. They're not just serving them to members. They're not just serving them to human beings. They're not even necessarily serving files to web browsers. They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET. There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

replies(5): >>45776945 #>>45777251 #>>45777676 #>>45778877 #>>45783879 #
5. jraph ◴[] No.45776945{3}[source]
> There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.

But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.

I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.

> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

It is not about denying access to the content to some and allowing access to others.

It is about having to deal with abuses.

Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.

6. jraph ◴[] No.45777043{3}[source]
> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.

replies(1): >>45779087 #
7. oytis ◴[] No.45777251{3}[source]
Technically, you are not serving anything - it's just voltage levels going up and down with no meaning at all.
8. Retric ◴[] No.45777676{3}[source]
> They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET.

Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.

The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.

replies(1): >>45778086 #
9. jMyles ◴[] No.45778086{4}[source]
...but the matter of "what the law cares about" is not really the point of contention here - what matters here is what happens in the real world.

In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.

replies(1): >>45778591 #
10. Retric ◴[] No.45778591{5}[source]
> In the real world, these requests are being made, and servers are generating responses.

Except that’s not the end of the story.

If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.

That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.

11. bigbuppo ◴[] No.45778877{3}[source]
How about AI companies just act ethically and obey norms?
12. jack_pp ◴[] No.45779087{4}[source]
That's like saying you shouldn't have to sanitize your database inputs because you never asked for people to SQL inject your database. This stance is truly mind boggling to me
replies(2): >>45779758 #>>45779855 #
13. jraph ◴[] No.45779758{5}[source]
Would you take the defense of attackers using SQL injections? Because it feels like people here, including you, are defending the llm scrapers against sysadmins and authors who dare share their work publicly.

Ensuring basic security and robustness of a piece of software is simply not remotely comparable to countering the abuse these llm companies carry on.

But it's not even the point. And preventing SQL injections (through healthy programming practices) doesn't make things worse for any legitimate user neither.

14. catlifeonmars ◴[] No.45779855{5}[source]
It’s both. You should sanitize your inputs because there are bad actors, but you also categorize attempts to sql inject as abuse and there is legal recourse.
15. tremon ◴[] No.45783879{3}[source]
The CFAA wants to have a word. The fact that a server responds with a 200 OK has no bearing on the legality of your request, there's plenty of precedent by now.