I've submitted several complaints to AWS to get this traffic to stop, their typical followup is: We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.
I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.
The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.
I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.
Are there others that have similar experience?
gzip bomb is good if the bot happens to be vulnerable, but even just slowing down their connection rate is often sufficient - waiting just 10 seconds before responding with your 404 is going to consume ~7,000 ports on their box, which should be enough to crash most linux processes (nginx + mod-http-echo is a really easy way to set this up)
But since AWS considers this fine, I'd absolutely take the "redirecting the entirety of the traffic to aws abuse report page" approach. If they consider it abuse - great, they can go turn it off then. The bot could behave differently but at least curl won't add a referer header or similar when it is redirected, so the obvious target would be their instance hosting the bot, not you.
Actually, I would find the biggest file I can that is hosted by Amazon itself (not another AWS customer) and redirect them to it. I bet they're hosting linux images somewhere. Besides being more annoying (and thus hopefully attention-getting) for Amazon, it should keep the bot busy for longer, reducing the amount of traffic hitting you.
If the bot doesn't eat files over a certain size, try to find something smaller or something that doesn't report the size in response to a HEAD request.
AWS has become rather large and bloated and does stupid things sometimes, but they do still respond when you get their lawyers involved.
The first demand letter from a lawyer will usually stop this. The great thing about suing big companies is that they have to show up. You have no contractual agreement which prevents suing; this is entirely from the outside.
The TikTok Byte Dance / Byte Spider bots were making millions of image requests from my site.
Over and over again and they would not stop.
I eventually got Cloudinary to block all the relevant user agents, and initially just totally blocked Singapore.
It’s very abusive on the part of these bot running AI scraping companies!
If I hadn’t been using the kind and generous Cloudinary, I could have been stuck with some seriously expensive hosting bills!
Nowadays I just block all AI bots with Cloudflare and be done with it!
It's a reverse-proxy / load balancer with built-in firewall and automatic HTTPS. You will be able to easily block the annoying bots with rules (https://pingoo.io/docs/rules)
The problem with DDoS-attacks is generally the asymmetry, where it requires more resources to deal with the request than to make it. Cute attempts to get back at the attacker with various tarpits generally magnifies this and makes it hit even harder.
I was so pissed off that I setup a redirect rule for it to send them over to random porn sites. That actually stopped it.
This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.
It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
https://news.ycombinator.com/item?id=42725147
Is this a good solution??
Make it follow redirects to some kind of illegal website. Be creative, I guess.
The reasoning being that if you can get AWS to trigger security measures on their side, maybe AWS will shut down their whole account.
Depending on how the crawler is designed this may or may not work. If they are using SQS with Lambda then that will obviously not work but it will fire back nevertheless because the serverless functions will be running for longer (5 - 15 minutes).
Another technique that comes to mind is to try to force the client to upgrade the connection (i.e. websocket). See what will happen. Mostly it will fail but even if it gets stalled for 30 seconds that is a win.
https://github.com/0x48piraj/gz-bomb/blob/master/gz-bomb-ser...
This is from your own post, and is almost the best answer I know of.
I recommending you configure a Cloudflare WAF rule to block the bot - and then move on with your life.
Simply block the bot and move on with your life.
Wouldn't recommend Googling it. You either know or just take a guess.
It sounds like the bot operator is spending enough on AWS to withstand the current level of abuse reports.
If you really wanted to retaliate, you could try getting a warrant to force AWS to disclose the owners of that AWS instance.
I'd suggest taking a look into patterns and IP rotation (if any) and perhaps blocking IP CIDR at the web server level, if the range is short.
Why simple deny from 12.123.0.0/16 (Apache) is not working for you?
301 response to a selection of very large files hosted by companies you don't like.
When their AWS instances start downloading 70000 windows ISOs in parallel, they might notice.
Hard to do with cloudflare but you can also tar pit them. Accept the request and send a response, one character at a time (make sure you uncork and flush buffers/etc), with a 30 second delay between characters.
700 requests/second with say 10Kb headers/response. Sure is a shame your server is so slow.
Sometimes these crawlers are just poorly written not malicious. Sometimes it’s both.
I would try a zip bomb next. I know there’s one that is 10 MB over the network and unzips to ~200TB.
What I'd do is block the AWS AP range at the edge (unless there's something else there that needs access to your site) - you can get regularly updated JSON formatted lists around the internet, or have something match its fingerprint to send it heaps of garbage, like the zip-bombs others have suggested. It could be a recursive "you're abusing my site - go away" or what-have-you. You could also do some-kind of grey-listing, where you limit the speed to a crawl so that each connection just consumes crawler resources and gets little content. If they are tracking this, they'll see the performance issues and maybe adjust.
Similarly, you can also try delivering one byte every 10 seconds or 30 seconds or whatever keeps the client on the other end hanging around for without hitting an internal timeout.
for char in itertools.repeat(b"FUCKOFF"):
await resp.send(char)
await resp.flush()
await asyncio.sleep(10)
# etc
In the SMTP years we called this tarpitting IIRCBut I’m not sure I understand your distinction. A scraper is a crawler regardless of whether it is “custom”or an off the shelf solution.
The author also said the bot identifed itself as a crawler
> Mozilla/5.0 (compatible; crawler)
The first goatse I actually saw was in ASCII form, funnily enough.
"what if we make the bots go stealthy and indistinguishable from actual human requests?"
"Mission Accomplished"
It was pretty clear in our case that they were scraping our site to get our pricing data. Our master catalog had several million SKUs, priced dynamically based on availability, customer contracts, and other factors. And we tried to add some value to the product pages, with relevant recommendations for cross-sells, alternate choices, etc. This was pretty compute-intensive, and the volume of the scraping could amount to a DoS at times. Like, they could bury us in bursts of requests so quickly that our infrastructure couldn't spin up new virtual servers, and once we were buried, it was difficult to dig back out from under the load. We learned a lot during this period, including some very counterintuitive stuff about how some approaches to queuing and prioritizing that appeared sounded great on paper, actually could have unintended effects that made such situations worse.
One strategy we talked about was that, rather than blocking the bad guys, we'd tag the incoming traffic. We couldn't do this perfect accuracy, but the inaccuracy was such that we could at least ensure that it wasn't affecting real customers (because we could always know when it was a real, logged-in user). We realized that we could at least cache the data in the borderline cases so we wouldn't have to recalculate (it was a particularly stupid bot that was attacking us, re-requesting the same stuff many times over); from that it was a small step to see that we could at the same time add a random fudge factor into any numbers, hoping to get to a state where the data did our attacker more harm than good.
We wound up doing what the OP is now doing, working with CloudFlare to identify and mitigate "attacks" as rapidly as possible. But there's no doubt that it cost us a LOT, in terms of developer time, payments to CF, and customer dissatisfaction.
By the way, this was all the more frustrating because we had circumstantial evidence that the attacker was a service contracted by one of our competitors. And if they'd come straight to us to talk about it, we'd have been much happier (and I think they would have been as well) to offer an API through which they could get the catalog data easily and in a way where we don't have to spend all the compute on the value-added stuff we were doing for humans. But of course they'd never come to us, or even admit it if asked, so we were stuck. And while this was going, there was also a case in the courts that was discussed many times here on HN. It was a question about blocking access to public sites, and the consensus here was something like "if you're going to have a site on the web, then it's up to you to ensure that you can support any requests, and if you can't find a way to withstand DoS-level traffic, it's your own fault for having a bad design". So it's interesting today to see that attitudes have changed.
How did that happen, why? I feel like a lot of people here would not want to make the same mistake, so details would be very welcome.
As long as pages weren't being served and so there was never any case of requesting ads but never showing them, I don't understand why Ads would care?
Not ideal, but it seems to work against primitive bots.
The submission and the context is when current blocking doesn't work...
> The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.
So you're able to show financial hardship
Assuming one trusts the user-agent in this case one could reduce the traffic reply to them and avoid touching the disk or any applications in Nginx with something like:
if ($http_user_agent ~ (crawler|some-other-bot) ) { return 200 '\n\n\n\nBot quota exceeded, check back in 2150 years.\n\n\n\n'; }
There are other variables to look for to see if something is a bot but such things should be very well tested. $http_accept_language, $http_sec_fetch_mode, etc...I don't use CF but maybe they have a way to block the entire ASN for AWS on your account assuming one does not need inbound connections from them. I just blackhole their CIDR blocks [1] but that won't help someone using a CDN.
Decades later, I'm still traumatized by goatse, so it'll have to be someone with more fortitude than me.
> had circumstantial evidence that the attacker was a service contracted by one of our competitors
> we'd have been much happier ... to offer an API through which they could get the catalog data easily
Why not feed them bad data?
As for trying to get them to stop, maybe redirect the bot to random IP:port combinations in a network that's less friendly to being scanned? I believe certain parts of DoD IP space tends to not look kindly upon attempts to scan them.
Depending on your setup, you could try to poison the bot's DNS for your domain. Send them the IP address of their local police force maybe.
My guess is that this is yet another AI scraper. There are others complaining about this bot online but all they seem to come up with is blocking the ASN in Cloudflare.
If there's no technical solution, if consider consulting with a legal professional to see if you can get Amazon to take action. Lawyers are expensive, but so is a Cloudflare bill when they decide you need to be on the "enterprise" tier.
I wish AWS would curtail abuse from their networks. My hope is to build some tools to automate detection and reporting of this sort of abuse, so we can force it into AWS's court.
I've resorted to blocking entire AS routes to prevent it (fortunately I am mostly hosting US sites with US only residential audiences). I'm not sure who's behind it, but one of the later data centers is oxylabs, so they're probably involved somehow.
Even funnier, include the EICAR test string in the redirect ot the cloud provider metadata. Maybe we could trip some automated compromise detection.
Sounds like the opposite of the [1] Slow Loris DDOS attack. Instead of attacking with slow connections, you’re defending with slow connections
[1] https://www.cloudflare.com/en-au/learning/ddos/ddos-attack-t...
Another idea is replying with large cookies and seeing if the bot saves them and replies with them (once again, to eat traffic)
The idea is to increase their egress to the point someone notices (the bill)
You will have 7000 sockets (file descriptors), but that's much more manageable than 7000 ports.
I tend to be careful with residential or office IP ranges. But if it looks like a datacenter, it will be blocked, no second thoughts. Especially if it's a cloud provider that makes it too easy for customers to rotate IPs. Identify the ASN within which they're rotating their IPs, and block it. This is much more effective than blocking based on arbitrary CIDRs or geographical boundaries.
Unless you're running an API for developers, there's no legitimate (non-crawling) reason for someone to request your site from an AWS resource. Even less so for something like Huawei Cloud.
If your server returns different content when Google crawls it compared to when normal users visit, they might suspect that you are trying to game the system. And yes, they do check from multiple locations with non-Googlebot user agents.
I'm not sure if showing an error page also counts as returning different content, but I guess the problem could be exacerbated by any content you include in the error page unless you're careful with the response code. Definitely don't make it too friendly. Whitelist important business partners.
I used to run an X instance in the cloud that I would sometimes browse websites from. It sucked but it was also legitimate.
In fact, the ability to move to a different cloud on short notice is also part of the CAPTCHA, because large cloud-based botnets usually can't. They'd get instabanned if they tried to move their crawling boxes to something like DigitalOcean.
That is strictly less resource intensive than serving 200 and some challenge.
I wrote a quick-and-dirty program that reads the authoritative list of all AWS IP ranges from https://ip-ranges.amazonaws.com/ip-ranges.json (more about that URL at the blog post https://aws.amazon.com/blogs/aws/aws-ip-ranges-json/), and creates rules in Windows Firewall to simply block all of them. Granted, it was a sledgehammer, but it worked well enough.
Here's the README.md I wrote for the program, though I never got around to releasing the the code: https://markdownpastebin.com/?id=22eadf6c608448a98b6643606d1...
It ran for some years as a scheduled task on a small handful of servers, but I'm not sure if it's still in use today or even works anymore. If there's enough interest I might consider publishing the code (or sharing it with someone who wants to pick up the mantle). Alternatively it wouldn't be hard for someone to recreate that effort.
G'luck!
They have control of what goes on on their computers and they are responsible.