The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.
In that regard reading my logs led me sometimes to interesting articles about cyber security. Also log flooding may result in your journaling service truncating the log and you miss something important.
If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.
Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.
The bonus is my actual customers get the same benefits and don't notice any material loss from my content _not_ being scraped. How you see this as me being secretly taken advantage of is completely beyond me.
That's not much for any modern server so I genuinely don't understand the frustration. I'm pretty certain gitea should be able to handle thousands of read requests per minute (not per hour) without even breaking a sweat.
I wonder what all those people are doing that their server can't handle the traffic. Wouldn't a simple IP-based rate limit be sufficient? I only pay $1 per month for my VPS, and even that piece of trash can handle 1000s of requests per second.
Yeah, this is beyond irresponsible. You know the moment you're pwned, __you__ become the new interesting story?
For everyone else, use a password manager to pick a random password for everything.
Depends on the computational cost per request. If you're serving static content from memory, 10k/s sounds easy. If you constantly have to calculate diffs across ranges of commits, I imagine a couple dozen can bring your box down.
Also: who's your webhost? $1/m sounds like a steal.
plaintextPassword = POST["password"]
ok = bcryptCompare(hashedPassword, plaintextPassword)
// (now throw away POST and plaintextPassword)
if (ok) { ... }
Bonus points: on user lookup, when no user is found, fetch a dummy hashedPassword, compare, and ignore the result. This will partially mitigate username enumeration via timing attacks.I encountered exactly one actual problem: the temporary folder for zip snapshots filled up the disk since bots followed all snapshot links and it seems gitea doesn't delete generated snapshots. I made that directory read-only, deleted its contents, and the problem was solved, at the cost of only breaking zip snapshots.
I experienced no other problems.
I did put some user-agent checks in place a while later, but that was just for fun to see if AI would eventually ingest false information.
Serving up a page that takes a few dozen db queries is a lot different than serving a static page.
if you put your server up on the public internet then this is just table stakes stuff that you always need to deal with, doesn't really matter whether the traffic is from botnets or crawlers or AI systems or anything else
you're always gonna deal with this stuff well before the requests ever get to your application, with WAFs or reverse proxies or (idk) fail2ban or whatever else
also 1000 req/hour is around 1 request every 4 seconds, which is statistically 0 rps for any endpoint that would ever be publicly accessible
So unless you're not logging your request path/query string you're doing something very very wrong by your own logic :). I can't imagine diagnosing issues with web requests and not be given the path + query string. You can diagnose without but you're sure not making things easier
Until AI crawlers chased me off of the web, I ran a couple of fairly popular websites. I just so rarely see anybody including passwords in the URLs anymore that I didn't really consider that as what the commenter was talking about.
There are attackers out there that send SIP/2.0 OPTIONS requests to the GOPHER port, over TCP.
Background scanner noise on the internet is incredibly common, but the AI scraping is not at the same level. Wikipedia has published that their infrastructure costs have notably shot up since LLMs started scraping them. I've seen similar idiotic behavior on a small wiki I run; a single AI company took the data usage from "who gives a crap" to "this is approaching the point where I'm not willing to pay to keep this site up." Businesses can "just" pass the costs onto the customers (which is pretty shit at the end of the day,) but a lot of privately run and open source sites are now having to deal with side crap that isn't relevant to their focus.
The botnets and DDOS groups that are doing mass scanning and testing are targeted by law enforcement and eventually (hopefully) taken down, because what they're doing is acknowledged as bad.
AI companies, however, are trying to make a profit off of this bad behavior and we're expected to be okay with it? At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
One time, sure. But unauthenticated requests would surely be cached, authenticated ones skip the cache (just like HN works :) ), as most internet-facing websites end up using this pattern.
but yeah the issue is that as long as you have something accessible to the public, it's ultimately your responsibility to deal with malicious/aggressive traffic
> At some point impacting my services with your business behavior goes from "it's just the internet being the internet" to willfully malicious.
I think maybe the current AI scraper traffic patterns are actually what "the internet being the internet" is from here forward
The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.
60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.
Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.
There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.
You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.
Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.
It blocks a lot of bots, but I feel like just running on a high port number (10,000+) would likely do better.
Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):
https://www.niss.org/sites/default/files/Tassone_interface6....
Saying “just cache this” is not sustainable. And this is only one repository; the only reasonable way to deal with this is some sort of traffic mitigation, you cannot just deal with the traffic as the happy path.
You're absolutely right. That's my mistake — you are requesting a specific version of WordPress, but I had written a Rails app. I've rewritten the app as a WordPress plugin and deployed it. Let me know if there's anything else I can do for you.
We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.
And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.
I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.
if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.
It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.
They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.
I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.
Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.
(One other thing is that the "tell me without telling me" thing is an internet trope and the site guidelines ask people to avoid those - they tend to make for unsubstantive comments, plus they're repetitive and we're trying to avoid that here. But I just mention this for completeness - it's secondary to the other point.)
I'd just add one other thing: there's one word in your post here which packs a huge amount of meaning and that's seemed (as in "seemed to be coming from a place [etc.]"). I can't tell you how often it happens that what seems one way to one user—even when the "seems" seems overwhelmingly likely, as in near-impossible that it could be any other way—turns out to simply be mistaken, or at least to seem quite opposite to the other person. It's thousands of times easier to make a mistake in this way than people realize; and unfortunately the cost can be quite high when that happens because the other person often feels indignant ("how dare you assume that I [etc.]").
In the present case, I don't know anything about the experience level of the user who posted https://news.ycombinator.com/item?id=45011628, but https://news.ycombinator.com/item?id=45011442 was definitely posted by someone who has managed heavy-duty web facing services, and that comment says more or less the same thing as the other one.
btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?
Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.
Six of one, .008 of a dozen of the other.
Also, they might share the common viewpoint of "it's the internet; suck it up."
Kinda my point was that it's only the internet being the internet if we tolerate it. If enough people give a crap, the corporations doing it will have to knock it off.
if you wanna rage against the machine then more power to you but this line of thinking is dead on arrival in terms of outcome