EDIT: I was chastised, here's the original text of my comment: Did you read the article or just the title? They aren't claiming it's abusive. They're saying it's a viable signal to detect and ban bots.
> Last Sunday I discovered some abusive bot behaviour [...]
Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.
I doubt it.
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape
Obligatory: https://stackoverflow.com/questions/1732348/regex-match-open...
For the most part, everybody is participating now, and that brings all of the challenges of any other space with everyone's competing interests colliding - but fewer established systems of governance.
I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone's creative output (which competes against and starves the creative of income).
The world changed really quickly and our legal systems haven't kept up. It is hurting real people who used to have small side businesses.
No one is calling for the criminalization of door-to-door sales and no one is worried about how much door-to-door sales increases water consumption.
Anybody may watch the demo screen of an arcade game for free, but you have to insert a quarter to play — and you can have even greater access with a key.
> and you’ve explicitly left a sign saying ‘you are not welcome here’
And the sign said "Long-haired freaky people Need not apply" So I tucked my hair up under my hat And I went in to ask him why He said, "You look like a fine upstandin' young man I think you'll do" So I took off my hat and said, "Imagine that Huh, me workin' for you"
Lets say a large art hosting site realizes how damaging AI training on their data can be - should they respond by adding a paywall before any of their data is visible? If that paywall is added (let's just say $5/mo) can most of the artists currently on their site afford to stay there? Can they afford it if their potential future patrons are limited to just those folks who can pay $5/mo? Would the scraper be able to afford a one time cost of $5 to scrape all of that data?
I think, as much they are a deeply flawed concept, this is a case where EULAs or an assumption of no-access for training unless explicitly granted that's actually enforced through the legal system is required. There are a lot of small businesses and side projects that are dying because of these models and I think that creative outlet has societal value we would benefit from preserving.
We should start suing these bad actors. Why do techies forget that the legal system exists?
However,
(1) there's a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it's for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it's usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be "but it was free, don't offer it for free if you don't want it to be taken for free". Same thing here.
(2) there's a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work / the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let's call a spade a spade)
(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.
Well behaved robots do not usually use millions of residential IPs through shady apps to "Perform a get request to an open HTTP server".
This "legal concept" is enforceable through legacy systems of police and violence. The internet does not recognize it. How much more obvious can this get?
If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?
If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?
People here on HN are laughing at the UKs Online Safety Act for trying to impose restrictions on people in other countries, and yet now you're implying that similar restrictions can be placed on people in other countries and over whom you have neither power nor control.
"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"
Yeah, well…
As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.
As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.
As if technical details were good carriers of human intents.
People who ignore polite requests are assholes, and we are well within our rights to complain about them.
I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
That's a shitty world that we shouldn't have to live in.
I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.
I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?
> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way
The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.
Edit:
> you're more than welcome to put an authentication gate around your content.
I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.
So what's the solution? How do I host a website that welcomes human visitors, but rejects all scrapers?
There is no mechanism! The best I can do is a cat-and-mouse arms race where I try to detect the traffic I don't want, and block it, while the people generating the traffic keep getting more sophisticated about hiding from my detection.
No, putting up a paywall is not a reasonable response to this.
> The question is are you expressing a preference on etiquette versus a hard rule that must be followed.
Well, there really aren't any hard rules that must be followed, because there are no enforcement mechanisms outside of going nuclear (requiring login). Everything is etiquette. And I agree that robots.txt is also etiquette, and it is super messed up that we tolerate "AI" companies stomping all over that etiquette.
Do we maybe want laws that say everyone must respect robots.txt? Maybe? But then people will just move their scrapers to a jurisdiction without those laws. And I'm sure someone could make the argument that robots.txt doesn't apply to them because they spoofed a browser user-agent (or another user-agent that a site explicitly allows). So perhaps we have a new mechanism, or new laws, or new... something.
But this all just highlights the point I'm making here: there is no reasonable mechanism (no, login pages and http auth don't count) for site owners to restrict access to their site based on these sorts of criteria. And that's a problem.
You use "legacy" as if these systems are obsolete and on their way out. They're not. They're here to stay, and will remain dominant, for better or worse. Calling them "legacy" feels a bit childish, as if you're trying to ignore reality and base arguments on your preferred vision of how things should be.
> The internet does not recognize it.
Sure it does. Not universally, but there are a lot of things governments and law enforcement can do to control what people see and do on the internet.
> If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?
No, of course not, that's silly. That only really works on the margins. Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.
> If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?
Because the governments of all the copyrightistans will block all traffic going in and out of copyleftistan. While this may not stop determined, technically-adept people, it will work for the most part. As I said, this sort of thing only really works on the margins.
You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!
Already pretty well established with Ad-block actually. It's a pretty similar case even. AI's don't click ads, so why should we accept their traffic? If it's un-proportionally loading the server without contributing to the funding of the site, get blocked.
The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.
In Germany, it is the law. § 44b UrhG says (translated):
(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.
(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.
(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.
> A few of these came from user-agents that were obviously malicious:
(I love the idea that they consider any python or go request to be a malicious scraper...)
1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.
2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.
It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.
As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.
solution is the same, implement better security
If you want to control how someone accesses something, the onus is on you to put access controls in place.
The people who put things on a public, un-restricted server and then complain that the public accessed it in an un-restricted way might be excusable if it's some geocities-esque Mom and Pop site that has no reason to know better, but 'cryptography dog' ain't that.
I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.
But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.
I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.
> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.
It is not about denying access to the content to some and allowing access to others.
It is about having to deal with abuses.
Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.
Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.
> It is inherently a cat and mouse game that you CHOOSE to play
No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.
A vast hoard of personal information exists and most of it never had or will have proper consent, knowledge, or protection.
I know a thing or two about web scraping.
There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).
Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.
Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.
That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.
https://github.com/rumca-js/crawler-buddy
Based on library
But I think it's moot, parsing HTML is not very expensive if you don't have to actually render it.
Fence off your yard if you don't want people coming by and dumping a mountain of garbage on it every day.
You can certainly choose to live in a society that thinks these are acceptable solutions. I think it's bullshit, and we'd all be better off if anyone doing these things would be breaking rocks with their teeth in a re-education camp, until they learn how to be a decent human being.
Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.
The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.
You don't need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there's implied consent that I'm willing to accept reasonable, and valid GET requests.
But I have configured my server to spend server resources the way I want, you don't like how my server works, so your configure your bot to lie. If you get what you want only because you're willing to lie, where's the implied consent?
> robots.txt. This is not the law
In Germany, it is the law. § 44b UrhG says (translated):
(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.
(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.
(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.
This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.
> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.
You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.
> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.
It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.
> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.
Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?
I guess that's a take...
There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.
Your browser has a bug, if you leave my webpage open in a tab, because of that bug, it's going to close the connection, reconnect, new tls handshake and everything and re-request that page without any cache tag, every second, everyday, for as long as you have the tab open.
That feels kinda problematic, right?
Web servers block well formed clients all the time, and I agree with you, that's dumb. But servers should be allowed to serve only the traffic they wish. If you want to use some LLM client, but the way that client behaves puts undue strain on wy server, what should I do, just accept that your client, and by proxy you, are an asshole and just accept that?
You shouldn't put your rules on my webserver, exactly as much I my webserver shouldn't put my rules on yours. But i believe that ethically, we should both attempt to respect and follow the rules of the other. Blocking traffic when it starts to behave abusively. It's not complex, just try to be nice and help the other as much as you reasonably can.
> You use "legacy" as if these systems are obsolete and on their way out. They're not.
I have serious doubts that nation states will still exist in 500 years. I feel quite certain that they'll be gone in 10,000. And I think it's generally good to build an internet for those time scales.
> base arguments on your preferred vision of how things should be.
I hope we all build toward our moral compass; I don't mean for arguments to fall into fallacies on this basis, but yeah I think our internet needs to resilient against the waxing and waning of the affairs of state. I don't know if that's childish... Maybe we need to have a more child-like view of things? The internet _is_ a child in the sense of its maturation timeframe.
> there are a lot of things governments and law enforcement can do to control what people see and do on the internet.
Of course there are things that governments do. But are they effective? I just returned from a throatsinging retreat in Tuva - a fairly remote part of Siberia. The Russian government has apparently quietly begun to censor quite a few resources on the internet, and it has caused difficulty in accessing the traditional music of the Tuvan people. And I was very happily astonished to find that everybody to whom I ran into, including a shaman grandmother, was fairly adept at routing around this censorship using a VPN and/or SSH tunnel.
I think the internet is doing a wonderful job at routing around censorship - better than any innovation ever discovered by humans so far.
> Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.
Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.
And at the end of the day, I don't think governments even have the power here - the content creators do. I distribute my music via free channels because that's the easiest way to reach my audience, and because, given the high availability of compelling free content, there's just no way I can make enough money on publishing to even concern myself with silly restrictions.
It seems to me that I'm ahead of the curve in this area, not behind it. But I'm certainly open to being convinced otherwise.
In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
Yes, and if the rule of not dumping a ton of manure on your driveway is so important to you, you should live in a gated community and hire round-the-clock security. Some people do, but living in a society where the only way to not wake up with a ton of manure in your driveway is to spend excessive resources on security is not the world that I would prefer to live in. And I don't see why people would spend time to prove this is the only possible and normal world - it's certainly not the case, we can do better.
50:1 compression ratio, but it's legitimately an implementation of a rubiks cube, that I wasn't actually making as any sort of trap, just wasn't thinking about file size, so any rule that filters it out is going to have a nasty false positive rate
This whole thing has made me hate humans, so so much. Robots are much better.
Server owners should have all the right to set the terms of their server access. Better tools to control LLMs and scrapers are all good in my book.
I really wish ad platforms were better at managing malware, trackers and fraud through. It is rather difficult to fully argue for website owner authority with how bad ads actually are for the user.
If you are building a new search engine and the robots.txt only include Google, are you an asshole indexing the information?
A curious position. There isn't a secondary species using the internet. There is only humans. Unless you foresee some kind of alien invasion or earthworm uprising, nothing other than humans will ever access your content. Rejecting the tools humans use to bridge their biological gaps is rather nonsensical.
> You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!
I suppose it would be pretty neat if humans were born with some kind of internet-like telepathy ability, but lacking that mechanism isn't any kind of real problem. Humans are well adept at using tools and have successfully used tools for millennia. The internet itself is a tool! Which, like before, makes rejecting the human use of tools nonsensical.
Some antivirus and parental control control software will scan links sent to someone from their machine (or from access points/routers).
Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.
Actually, I'm not entirely sure how one would tell the difference between a user software scanning links to detect adult content/malware/etc, randos crawling the web searching for personal information/vulnerable sites/etc. and these supposed "AI crawlers" just from access logs.
While I'm certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven't seen much in the way of evidence that is the case.
Whatever impact your new search engine or LLM might have in the world is irrelevant to their wishes.
Except that’s not the end of the story.
If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.
That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.
If your antivirus software hammers the same website several times a second for hours on end, in a way that is indistinguishable from an "AI crawler", then maybe it's really misbehaving and should be stopped from doing so.
The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I've founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.
Personally, I'm skeptical of blaming everything on AI scrapers. Everything people are complaining about has been happening for decades - mostly by people searching for website vulnerabilities/sensitive info who don't care if they're misbehaving, sometimes by random individuals who want to archive a site or are playing with a crawler and don't see why they should slow them down.
Even the techniques for poisoning aggressive or impolite crawlers are at least 30 years old.
At the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I'm prepared to accept the consequences of a "real user" tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that'll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I'm 100% OK with those consequences. For work's website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.
Anybody who doesn't consider typical AI company's webscraping behaviour over the last few years to qualify as "abuse" has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular/respected sites.
The only thing that seems to have changed is that today's thread is full of people who think they have some sort of human right to access any website by any means possible, including their sloppy vibe-coded crawler. In the past, IIRC, people used to be a little more apologetic about consuming other people's resources and did their best to fly below the radar.
It's my website. I have every right to block anyone at any time for any reason whatsoever. Whether or not your use case is "legitimate" is beside the point.
If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.
1. The most successful scrapers avoid standing out in any way
Ok, I am, right now.
It seems like there are two sides here that are talking past one another: "people will do X and you accept it if you do not actively prevent it, if you can" and "X is bad behavior that should be stopped and shouldn't be the burden of individuals to stop". As someone who leans to the latter, the former just sounds like restating the problem being complained about.
Ensuring basic security and robustness of a piece of software is simply not remotely comparable to countering the abuse these llm companies carry on.
But it's not even the point. And preventing SQL injections (through healthy programming practices) doesn't make things worse for any legitimate user neither.
But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.
A healthy society relies on a cooperation between members. It relies on them accepting some rules that limits their behavior. Like we agreed not to kill others, and now I can go outside without weapons and anti-bullet defenses.
Currently we have at least three problems:
1) Companies have no issue with not providing sources and not linking back.
2) There are too many scrapers, even if they behaved, some site would struggle to handle all of them.
3) Srapers go full throttle 24/7, expecting the sites to rate-limit them if they are going to fast. Hammer a site into the ground, just wait until it's back and hammer it again, grabbing what you can before it crashes once more.
There's no longer a sense of the internet being for all of us and that we need to make room for each other. Website / human generated content exists as a resource to be strip mined.
No, it's just background internet scanning noise
If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they're commented out or not, are a great way to find endpoints in modern web applications.
If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I'm not sure you'd even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?
Your framing is off because this notion of fairness or morality isn't something they concern themselves with. They're using violence because if they didn't, they would be allowing other entities to gain wealth and power at their expense. I don't think it's much more complex than that.
See how differently these same bytes are treated in the hands of Aaron Swartz vs OpenAI. One threatened to empower humanity at the expense of reducing profits for a few rich men, so he got crucified for it. The other is hoping to make humans redundant, concentrate the distribution of wealth even further, and strengthen the US world dominance, so all of the right wheels get greased for them and they get a license to kill - figuratively and literally.
Saying that a handful of mass copyright infringers with billion dollar investors are simply part of the "public" like every regular visitor is seriously distorting the issue here.
Sites with a robots.txt banning bots are only "unrestricted" in a strictly technical sense. They are clearly setting terms of use that these rogue bots are violating. Besides, robots.txt is legally binding in certain jurisdictions, it's not just a polite plea. And if we decide that anything not technically prevented is legal, then we're also legitimising botnets, DDoS attacks, and a lot more. Hacking into a corporate system through a malconfiguration or vulnerability is also illegal, despite the fact that the defenses failed.
Finally, we all know that the only purpose these bots are scraping for is mass copyright infringement. That's another layer where the "if it's accessible, it's fair game" logic falls apart. I can download a lot of publicly accessible art, music, or software, but that doesn't mean I can do with those files whatever I want. The only reason these AI companies haven't been sued out of existence yet, like they should've been, is that it's trickier to prove provenance than if they straight up served the unmodified files.
This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.
I suppose the ultimate solution would be browsers and operating systems and hardware manufacturers co-operating to implement some system that somehow cryptographically signs HTTP requests which attests that it was triggered by an actual, physical interaction with a computing device by a human.
Though you don't have to think for very long to come up with all kinds of collateral damage that would cause and how bad actors could circumvent it anyway.
All in all, this whole issue seems more like a legal problem than a technical one.
There are some states where the considerations for self defense do not include a duty to retreat if possible, either in general (“stand your ground" law) or specifically in the home (“castle doctrine"), but all the other requirements (imminent threat of certain kinds of serious harm, proportional force) for self-defense remain part of the law in those states, and trespassing by/while disregarding a ”no soliciting” would not, by itself, satisfy those requirements.
There's a difference between sending no data, and sending false data. I don't block requests without http referrers for that very reason.
tell me you've never heard of https://wttr.in/ without telling me. :P
It would absolutely be a bug iff this site returned html to curl.
> This is why there are so many libraries to make requests that look like they came from browser, to work around buggy servers or server operators with wrong assumptions.
This is a shallow take, the best counter example is how googlebot has no problem identifying it itself both in and out of thue user agent. Do note user agent packing, is distinctly different from a fake user agent selected randomly from the list of most common.
The existence of many libraries with the intent to help conceal the truth about a request doesn't feel like proof that's what everyone should be doing. It feels more like proof that most people only want to serve traffic to browsers and real users. And it's the bots and scripts that are the fuckups.
The best I can come up with is the TOR browser, which will reduce the number of bits of information it will return, but I dont consider that to be misleading. It's a custom build of firefox, that discloses it is firefox, and otherwise behaves exactly as I would expect firefox to behave.
Do you have a specific / concrete example in mind? Or are you mistaking a feature from something other than a mainstream browser?
I would actually argue, it's not nearly the same type of misconfiguration. The reason scripts, which have never been a browser, who omit their real identity, are doing it, is to evade bot detection. The reason browsers pack their UA with so much legacy data, is because of misconfigured servers. The server owner wants to send data to users and their browsers, but through incompetence, they've made a mistake. Browsers adapted by including extra strings in the UA to account for the expectations of incorrectly configured servers. Extra strings being the critical part, Google bot's UA is an example of this being done correctly.
Google bot doesn't get blocked from my server primarily because it's a *very* well behaved bot. It sends a lot of requests, but it's very kind, and has never acted in a way that could overload my server. It respects robots.txt, and identifies itself multiple times.
Google bot doesn't get blocked, because it's a well behaved bot that eagerly follows the rules. I wouldn't underestimate how far that goes towards the reason it doesn't get blocked. Much more than the power gained by being google search.