Most active commenters
  • hsbauauvhabzb(5)
  • jMyles(3)

←back to thread

255 points ColinWright | 24 comments | | HN request time: 0.452s | source | bottom
Show context
bakql ◴[] No.45775259[source]
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

replies(15): >>45775283 #>>45775392 #>>45775754 #>>45775912 #>>45775998 #>>45776008 #>>45776055 #>>45776210 #>>45776222 #>>45776270 #>>45776765 #>>45776932 #>>45777727 #>>45777934 #>>45778166 #
Calavar ◴[] No.45775392[source]
I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

replies(9): >>45775489 #>>45775674 #>>45776143 #>>45776484 #>>45776561 #>>45776927 #>>45777831 #>>45778192 #>>45779259 #
hsbauauvhabzb ◴[] No.45775489[source]
How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.
replies(5): >>45775544 #>>45775575 #>>45775693 #>>45775841 #>>45775924 #
1. Calavar ◴[] No.45775575[source]
If you are serving web pages, you are soliciting GET requests, kind of like ordering a package is soliciting a delivery.

"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape

replies(6): >>45775850 #>>45775994 #>>45776241 #>>45776635 #>>45776878 #>>45778341 #
2. yuliyp ◴[] No.45775850[source]
Having a front door physically allows anyone on the street to come to knock on it. Having a "no soliciting" sign is an instruction clarifying that not everybody is welcome. Having a web site should operate in a similar fashion. The robots.txt is the equivalent of such a sign.
replies(2): >>45775917 #>>45776385 #
3. halJordan ◴[] No.45775917[source]
No soliciting signs are polite requests that no one has to follow, and door to door salesman regularly walk right past them.

No one is calling for the criminalization of door-to-door sales and no one is worried about how much door-to-door sales increases water consumption.

replies(4): >>45777090 #>>45777176 #>>45779738 #>>45780613 #
4. munk-a ◴[] No.45775994[source]
I disagree strongly here - though not from a technical perspective. There's absolutely a legal concept of making your work available for viewing without making it available for copying and AI scraping (while we can technically phrase it as just viewing a bunch of times) is effectively copying.

Lets say a large art hosting site realizes how damaging AI training on their data can be - should they respond by adding a paywall before any of their data is visible? If that paywall is added (let's just say $5/mo) can most of the artists currently on their site afford to stay there? Can they afford it if their potential future patrons are limited to just those folks who can pay $5/mo? Would the scraper be able to afford a one time cost of $5 to scrape all of that data?

I think, as much they are a deeply flawed concept, this is a case where EULAs or an assumption of no-access for training unless explicitly granted that's actually enforced through the legal system is required. There are a lot of small businesses and side projects that are dying because of these models and I think that creative outlet has societal value we would benefit from preserving.

replies(1): >>45776247 #
5. andoando ◴[] No.45776241[source]
Well yes this is exactly what's happening as of now. But there SHOULD be a way to upload content without giving it access to scrapers.
6. jMyles ◴[] No.45776247[source]
> There's absolutely a legal concept of making your work available for viewing without making it available for copying

This "legal concept" is enforceable through legacy systems of police and violence. The internet does not recognize it. How much more obvious can this get?

If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

replies(1): >>45776668 #
7. czscout ◴[] No.45776385[source]
And a no soliciting sign is no more cosmically binding than robots.txt. It's a request, not an enforceable command.
replies(1): >>45778991 #
8. kelnos ◴[] No.45776635[source]
> If you are serving web pages, you are soliciting GET requests

So what's the solution? How do I host a website that welcomes human visitors, but rejects all scrapers?

There is no mechanism! The best I can do is a cat-and-mouse arms race where I try to detect the traffic I don't want, and block it, while the people generating the traffic keep getting more sophisticated about hiding from my detection.

No, putting up a paywall is not a reasonable response to this.

> The question is are you expressing a preference on etiquette versus a hard rule that must be followed.

Well, there really aren't any hard rules that must be followed, because there are no enforcement mechanisms outside of going nuclear (requiring login). Everything is etiquette. And I agree that robots.txt is also etiquette, and it is super messed up that we tolerate "AI" companies stomping all over that etiquette.

Do we maybe want laws that say everyone must respect robots.txt? Maybe? But then people will just move their scrapers to a jurisdiction without those laws. And I'm sure someone could make the argument that robots.txt doesn't apply to them because they spoofed a browser user-agent (or another user-agent that a site explicitly allows). So perhaps we have a new mechanism, or new laws, or new... something.

But this all just highlights the point I'm making here: there is no reasonable mechanism (no, login pages and http auth don't count) for site owners to restrict access to their site based on these sorts of criteria. And that's a problem.

9. kelnos ◴[] No.45776668{3}[source]
> legacy systems of police and violence

You use "legacy" as if these systems are obsolete and on their way out. They're not. They're here to stay, and will remain dominant, for better or worse. Calling them "legacy" feels a bit childish, as if you're trying to ignore reality and base arguments on your preferred vision of how things should be.

> The internet does not recognize it.

Sure it does. Not universally, but there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

> If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

No, of course not, that's silly. That only really works on the margins. Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

> If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

Because the governments of all the copyrightistans will block all traffic going in and out of copyleftistan. While this may not stop determined, technically-adept people, it will work for the most part. As I said, this sort of thing only really works on the margins.

replies(1): >>45778080 #
10. davesque ◴[] No.45776878[source]
If I order a package from a company selling a good, am I inviting all that company's competitors to show up at my doorstep to try and outbid the delivery person from the original company when they arrive, and maybe they all show up at the same time and cause my porch to collapse? No, because my front porch is a limited resource for which I paid for an intended purpose. Is it illegal for those other people to show up? Maybe not by the letter of the law.
11. oytis ◴[] No.45777090{3}[source]
> door to door salesman regularly walk right past them.

Oh, now I understand why Americans can't see a problem here.

12. ahtihn ◴[] No.45777176{3}[source]
If a company was sending hundreds of salesmen to knock at a door one after the other, I'm pretty sure they could successfully get sued for harassment.
replies(1): >>45778985 #
13. jMyles ◴[] No.45778080{4}[source]
I guess I'm more optimistic about the future of the human condition.

> You use "legacy" as if these systems are obsolete and on their way out. They're not.

I have serious doubts that nation states will still exist in 500 years. I feel quite certain that they'll be gone in 10,000. And I think it's generally good to build an internet for those time scales.

> base arguments on your preferred vision of how things should be.

I hope we all build toward our moral compass; I don't mean for arguments to fall into fallacies on this basis, but yeah I think our internet needs to resilient against the waxing and waning of the affairs of state. I don't know if that's childish... Maybe we need to have a more child-like view of things? The internet _is_ a child in the sense of its maturation timeframe.

> there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

Of course there are things that governments do. But are they effective? I just returned from a throatsinging retreat in Tuva - a fairly remote part of Siberia. The Russian government has apparently quietly begun to censor quite a few resources on the internet, and it has caused difficulty in accessing the traditional music of the Tuvan people. And I was very happily astonished to find that everybody to whom I ran into, including a shaman grandmother, was fairly adept at routing around this censorship using a VPN and/or SSH tunnel.

I think the internet is doing a wonderful job at routing around censorship - better than any innovation ever discovered by humans so far.

> Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.

And at the end of the day, I don't think governments even have the power here - the content creators do. I distribute my music via free channels because that's the easiest way to reach my audience, and because, given the high availability of compelling free content, there's just no way I can make enough money on publishing to even concern myself with silly restrictions.

It seems to me that I'm ahead of the curve in this area, not behind it. But I'm certainly open to being convinced otherwise.

replies(1): >>45780692 #
14. pluto_modadic ◴[] No.45778341[source]
ignoring a rate limit gets you blocked.
replies(1): >>45778997 #
15. hsbauauvhabzb ◴[] No.45778985{4}[source]
Can’t Americans literally shoot each other for trespassing?
replies(1): >>45779013 #
16. hsbauauvhabzb ◴[] No.45778991{3}[source]
Tell me you work in an ethically bankrupt industry without telling me you work in an ethically bankrupt industry.
17. hsbauauvhabzb ◴[] No.45778997[source]
Scrapers actively bypass this by rotating IP addresses.
18. dragonwriter ◴[] No.45779013{5}[source]
Generally, legally, no, not just for ignoring a “no soliciting” sign.
replies(1): >>45779406 #
19. hsbauauvhabzb ◴[] No.45779406{6}[source]
But they’re presumably trespassing.
replies(1): >>45782069 #
20. duskdozer ◴[] No.45779738{3}[source]
>No one is calling for the criminalization of door-to-door sales

Ok, I am, right now.

It seems like there are two sides here that are talking past one another: "people will do X and you accept it if you do not actively prevent it, if you can" and "X is bad behavior that should be stopped and shouldn't be the burden of individuals to stop". As someone who leans to the latter, the former just sounds like restating the problem being complained about.

21. distances ◴[] No.45780613{3}[source]
> No one is calling for the criminalization of door-to-door sales

Door-to-door sales absolutely are banned in many jurisdictions.

22. dns_snek ◴[] No.45780692{5}[source]
> Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.

Your framing is off because this notion of fairness or morality isn't something they concern themselves with. They're using violence because if they didn't, they would be allowing other entities to gain wealth and power at their expense. I don't think it's much more complex than that.

See how differently these same bytes are treated in the hands of Aaron Swartz vs OpenAI. One threatened to empower humanity at the expense of reducing profits for a few rich men, so he got crucified for it. The other is hoping to make humans redundant, concentrate the distribution of wealth even further, and strengthen the US world dominance, so all of the right wheels get greased for them and they get a license to kill - figuratively and literally.

replies(1): >>45782448 #
23. dragonwriter ◴[] No.45782069{7}[source]
And, despite what ideas you may get from the media, mere trespass without imminent threat to life is not a justification for deadly force.

There are some states where the considerations for self defense do not include a duty to retreat if possible, either in general (“stand your ground" law) or specifically in the home (“castle doctrine"), but all the other requirements (imminent threat of certain kinds of serious harm, proportional force) for self-defense remain part of the law in those states, and trespassing by/while disregarding a ”no soliciting” would not, by itself, satisfy those requirements.

24. jMyles ◴[] No.45782448{6}[source]
I mean... I agree with everything you've said here. I'm not sure what makes you think I've mis-framed the stakes.