252 points lgats | 2 comments | 17 Oct 25 05:28 UTC | HN request time: 0s | source

I have been struggling with a bot– 'Mozilla/5.0 (compatible; crawler)' coming from AWS Singapore – and sending an absurd number of requests to a domain of mine, averaging over 700 requests/second for several months now. Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

I've submitted several complaints to AWS to get this traffic to stop, their typical followup is: We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.

I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.

The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.

I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.

Are there others that have similar experience?

Show context

n_u ◴[17 Oct 25 16:46 UTC] No.45618867[source]▶

>>45613567 (OP) #

Dumb question but just cuz I didn’t see it mentioned have you tried using a Disallow: / in your robots.txt? Or Crawl-delay: 10? That would be the first thing I would try.

Sometimes these crawlers are just poorly written not malicious. Sometimes it’s both.

I would try a zip bomb next. I know there’s one that is 10 MB over the network and unzips to ~200TB.

replies(1): >>45618990 #

1. pknerd ◴[17 Oct 25 16:55 UTC] No.45618990[source]▶

>>45618867 #

It's for crawlers not custom scrapers

replies(1): >>45619157 #

2. n_u ◴[17 Oct 25 17:10 UTC] No.45619157[source]▶

>>45618990 (TP) #

Respecting robots.txt is a convention not enforced by anything so yes the bot is certainly free to ignore it.

But I’m not sure I understand your distinction. A scraper is a crawler regardless of whether it is “custom”or an off the shelf solution.

The author also said the bot identifed itself as a crawler

> Mozilla/5.0 (compatible; crawler)

↑

Ask HN: How to stop an AWS bot sending 2B requests/month?