Most active commenters

lgats(7)
(7)
kijin(5)
swiftcoder(4)
sim7c00(4)
crazygringo(3)
lucastech(3)

Popular/hot comments

>>45614001 #
>>45614240 #
>>45618776 #
>>45613614 #
>>45615089 #
>>45618404 #

Ask HN: How to stop an AWS bot sending 2B requests/month?

230 points lgats | 157 comments | 17 Oct 25 05:28 UTC | HN request time: 2.06s | source | bottom

I have been struggling with a bot– 'Mozilla/5.0 (compatible; crawler)' coming from AWS Singapore – and sending an absurd number of requests to a domain of mine, averaging over 700 requests/second for several months now. Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

I've submitted several complaints to AWS to get this traffic to stop, their typical followup is: We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.

I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.

The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.

I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.

Are there others that have similar experience?

1. giardini ◴[17 Oct 25 05:33 UTC] No.45613594[source]▶

>>45613567 (OP) #

Hire a lawyer and have him send the bill for his services to them immediately with a note on the consequences of ignoring his notices. Bill them aggressively.

replies(2): >>45614168 #>>45614202 #

2. bigfatkitten ◴[17 Oct 25 05:38 UTC] No.45613614[source]▶

>>45613567 (OP) #

Do you receive, or expect to receive any legitimate traffic from AWS Singapore? If not, why not blackhole the whole thing?

replies(3): >>45613762 #>>45614181 #>>45614472 #

3. caprock ◴[17 Oct 25 06:11 UTC] No.45613762[source]▶

Agreed. You should be able to set the waf to just drop the packets and not even bother with the overhead of a response. I think cloud flare waf calls this "block".

replies(2): >>45614189 #>>45623737 #

4. Jean-Papoulos ◴[17 Oct 25 06:14 UTC] No.45613772[source]▶

>>45613567 (OP) #

You don't even need to send a response. Just block the traffic and move on

5. MrThoughtful ◴[17 Oct 25 06:15 UTC] No.45613783[source]▶

>>45613567 (OP) #

If it follows redirects, have you tried redirecting it to its own domain?

replies(1): >>45621220 #

6. Scotrix ◴[17 Oct 25 06:34 UTC] No.45613877[source]▶

>>45613567 (OP) #

Just find a Hoster with low traffic egress cost, reverse proxy normal traffic to Cloudflare and reply with 2GB files for the bot, they annoy you/cost you money, make them pay.

replies(1): >>45614034 #

7. 2000swebgeek ◴[17 Oct 25 06:34 UTC] No.45613879[source]▶

>>45613567 (OP) #

block the IPs or setup an WAF on AWS if you cannot be on Cloudflare.

replies(1): >>45614219 #

8. shishcat ◴[17 Oct 25 06:36 UTC] No.45613884[source]▶

>>45613567 (OP) #

if it follows redirect, redirct him to a 10gb gzip bomb

replies(2): >>45613936 #>>45614278 #

9. nake89 ◴[17 Oct 25 06:46 UTC] No.45613936[source]▶

I was just going to post the same thing. Happy somebody else thought of the same thing :D

replies(1): >>45614115 #

10. brunkerhart ◴[17 Oct 25 06:50 UTC] No.45613960[source]▶

>>45613567 (OP) #

Write to aws abuse team

replies(1): >>45621146 #

11. swiftcoder ◴[17 Oct 25 06:59 UTC] No.45614001[source]▶

>>45613567 (OP) #

Making the obviously-abusive bot prohibitively expensive is one way to go, if you control the terminating server.

gzip bomb is good if the bot happens to be vulnerable, but even just slowing down their connection rate is often sufficient - waiting just 10 seconds before responding with your 404 is going to consume ~7,000 ports on their box, which should be enough to crash most linux processes (nginx + mod-http-echo is a really easy way to set this up)

replies(6): >>45614138 #>>45614240 #>>45614367 #>>45614560 #>>45619426 #>>45623137 #

12. snvzz ◴[17 Oct 25 07:06 UTC] No.45614030[source]▶

>>45613567 (OP) #

Null-route the entirety of AWS ip space.

13. tgsovlerkhgsel ◴[17 Oct 25 07:06 UTC] No.45614034[source]▶

Isn't ingress free at AWS? You'd have to find a way to generate absurd amounts of egress traffic - absurd enough to be noticed compared to billions of HTTP requests. 2B requests at 1 KB/request is 2 TB/month so they're likely paying a double-digit dollar amount just for the traffic they're sending to you (wtf - where does that money come from?).

But since AWS considers this fine, I'd absolutely take the "redirecting the entirety of the traffic to aws abuse report page" approach. If they consider it abuse - great, they can go turn it off then. The bot could behave differently but at least curl won't add a referer header or similar when it is redirected, so the obvious target would be their instance hosting the bot, not you.

Actually, I would find the biggest file I can that is hosted by Amazon itself (not another AWS customer) and redirect them to it. I bet they're hosting linux images somewhere. Besides being more annoying (and thus hopefully attention-getting) for Amazon, it should keep the bot busy for longer, reducing the amount of traffic hitting you.

If the bot doesn't eat files over a certain size, try to find something smaller or something that doesn't report the size in response to a HEAD request.

replies(2): >>45614430 #>>45618400 #

14. JCM9 ◴[17 Oct 25 07:21 UTC] No.45614100[source]▶

>>45613567 (OP) #

Have ChatGPT write you a sternly worded cease and desist letter and send it to Amazon legal via registered mail.

AWS has become rather large and bloated and does stupid things sometimes, but they do still respond when you get their lawyers involved.

15. sixtyj ◴[17 Oct 25 07:25 UTC] No.45614115{3}[source]▶

You nasty ones ;)

16. Orochikaku ◴[17 Oct 25 07:30 UTC] No.45614138[source]▶

Thinking along the same lines a PoW check like like anubis[1] may work for OP as well.

[1] https://github.com/TecharoHQ/anubis

replies(1): >>45614636 #

17. Animats ◴[17 Oct 25 07:37 UTC] No.45614168[source]▶

Yes. Computer Fraud and Abuse Act to start.

The first demand letter from a lawyer will usually stop this. The great thing about suing big companies is that they have to show up. You have no contractual agreement which prevents suing; this is entirely from the outside.

replies(1): >>45619604 #

18. firecall ◴[17 Oct 25 07:39 UTC] No.45614181[source]▶

Yep, I did this for a while.

The TikTok Byte Dance / Byte Spider bots were making millions of image requests from my site.

Over and over again and they would not stop.

I eventually got Cloudinary to block all the relevant user agents, and initially just totally blocked Singapore.

It’s very abusive on the part of these bot running AI scraping companies!

If I hadn’t been using the kind and generous Cloudinary, I could have been stuck with some seriously expensive hosting bills!

Nowadays I just block all AI bots with Cloudflare and be done with it!

19. molszanski ◴[17 Oct 25 07:40 UTC] No.45614183[source]▶

>>45613567 (OP) #

Maybe add this IP to a blacklist? https://iplists.firehol.org/ It would be easier to pressure AWS when it is there

20. reisse ◴[17 Oct 25 07:40 UTC] No.45614185[source]▶

>>45613567 (OP) #

What kind of content do you serve? 700 RPS is not a big number at all, for sure not enough to qualify as a DoS. I'm not surprised AWS did not take any action.

replies(2): >>45614205 #>>45616566 #

21. pingoo101010 ◴[17 Oct 25 07:40 UTC] No.45614186[source]▶

>>45613567 (OP) #

Take a look at https://github.com/pingooio/pingoo

It's a reverse-proxy / load balancer with built-in firewall and automatic HTTPS. You will be able to easily block the annoying bots with rules (https://pingoo.io/docs/rules)

22. marginalia_nu ◴[17 Oct 25 07:40 UTC] No.45614189{3}[source]▶

Yeah, this is the way. Dropping the packets makes the requests cheaper to respond to than to make.

The problem with DDoS-attacks is generally the asymmetry, where it requires more resources to deal with the request than to make it. Cute attempts to get back at the attacker with various tarpits generally magnifies this and makes it hit even harder.

23. tempestn ◴[17 Oct 25 07:42 UTC] No.45614202[source]▶

That's not how lawyers or bills work, unfortunately in this case, but fortunately in general.

24. marginalia_nu ◴[17 Oct 25 07:43 UTC] No.45614205[source]▶

FWIW, a HN hug of death, which fairly regularly knocks sites offline tends to peak at a few dozen RP.

replies(1): >>45614284 #

25. re-thc ◴[17 Oct 25 07:46 UTC] No.45614219[source]▶

AWS WAF isn’t free. Definitely cheaper but all the hits still cost.

26. lagosfractal42 ◴[17 Oct 25 07:49 UTC] No.45614240[source]▶

This kind of reasoning assumes the bot continues to be non-stealthy

replies(4): >>45614297 #>>45618883 #>>45619322 #>>45622067 #

27. neya ◴[17 Oct 25 07:51 UTC] No.45614249[source]▶

>>45613567 (OP) #

I had this issue on one of my personal sites. It was a blog I used to write maybe 7-8 years ago. All of a sudden, I see insane traffic spikes in analytics. I thought some article went viral, but realized it was too robotic to be true. And so I narrowed it down to some developer trying to test their bot/crawler on my site. I tried asking nicely, several times, over several months.

I was so pissed off that I setup a redirect rule for it to send them over to random porn sites. That actually stopped it.

replies(1): >>45614924 #

28. cantor_S_drug ◴[17 Oct 25 07:57 UTC] No.45614278[source]▶

https://zadzmo.org/code/nepenthes/

This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

https://news.ycombinator.com/item?id=42725147

Is this a good solution??

replies(1): >>45614552 #

29. reisse ◴[17 Oct 25 07:58 UTC] No.45614284{3}[source]▶

On the other hand, I've only seen complaint letters from AWS for doing tens of thousands of RPS on rate-limited endpoints for multiple days. Even then, AWS wasn't the initiator of inquiry (it was their customer being polled), and it wasn't a "cease and desist" kind of letter, it was "please explain what you're doing and prove you're not violating our ToS".

replies(1): >>45614408 #

30. swiftcoder ◴[17 Oct 25 08:00 UTC] No.45614297{3}[source]▶

I mean, forcing them to spend engineering effort the make their bot stealthy (or to be able to maintains 10's of thousands of open ports), is still driving up their costs, so I'd count it as a win. The OP doesn't say why the bot is hitting their endpoints, but I doubt the bot is a profit centre for the operator.

replies(1): >>45615789 #

31. mkj ◴[17 Oct 25 08:15 UTC] No.45614367[source]▶

AWS customers have to pay for outbound traffic. Is there a way to get them to send you (or cloudflare) huge volumes of traffic?

replies(2): >>45614423 #>>45614438 #

32. znpy ◴[17 Oct 25 08:18 UTC] No.45614389[source]▶

>>45613567 (OP) #

> I've tried 30X redirects (which it follows) to no avail

Make it follow redirects to some kind of illegal website. Be creative, I guess.

The reasoning being that if you can get AWS to trigger security measures on their side, maybe AWS will shut down their whole account.

33. hsbauauvhabzb ◴[17 Oct 25 08:22 UTC] No.45614408{4}[source]▶

Why would aws care if you’re consuming one of their customers resources when the customer is the one that pays?

34. horseradish7k ◴[17 Oct 25 08:27 UTC] No.45614423{3}[source]▶

yeah, could use a free worker

replies(1): >>45623234 #

35. ◴[17 Oct 25 08:28 UTC] No.45614430{3}[source]▶

36. ◴[17 Oct 25 08:28 UTC] No.45614433[source]▶

>>45613567 (OP) #

37. _pdp_ ◴[17 Oct 25 08:29 UTC] No.45614438{3}[source]▶

A KB zip file can expand to giga / petabytes through recursive nesting - though it depends on their implementation.

replies(1): >>45614913 #

38. lozenge ◴[17 Oct 25 08:35 UTC] No.45614472[source]▶

Here's the IP address ranges- https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-work...

39. _pdp_ ◴[17 Oct 25 08:38 UTC] No.45614486[source]▶

>>45613567 (OP) #

As others have suggested you can try to fight back depending on the capabilities of your infrastructure. All crawlers will have some kind of queuing system. If you manage to cause for the queues to fill up then the crawler wont be able to send as many requests. For example, you can allow the crawler to open the socket but you only send the data very slowly causing the queues to get filled quickly with busy workers.

Depending on how the crawler is designed this may or may not work. If they are using SQS with Lambda then that will obviously not work but it will fire back nevertheless because the serverless functions will be running for longer (5 - 15 minutes).

Another technique that comes to mind is to try to force the client to upgrade the connection (i.e. websocket). See what will happen. Mostly it will fail but even if it gets stalled for 30 seconds that is a win.

40. iberator ◴[17 Oct 25 08:49 UTC] No.45614552{3}[source]▶

Best tarpit ever.

41. gildas ◴[17 Oct 25 08:51 UTC] No.45614560[source]▶

Great idea, some people have already implemented it for the same type of need, it would seem (see the list of user agents in the source code). Implementation seems simple.

https://github.com/0x48piraj/gz-bomb/blob/master/gz-bomb-ser...

replies(1): >>45623447 #

42. stevoski ◴[17 Oct 25 08:57 UTC] No.45614604[source]▶

>>45613567 (OP) #

> Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

This is from your own post, and is almost the best answer I know of.

I recommending you configure a Cloudflare WAF rule to block the bot - and then move on with your life.

Simply block the bot and move on with your life.

replies(1): >>45620857 #

43. hshdhdhehd ◴[17 Oct 25 09:02 UTC] No.45614636{3}[source]▶

Avoid if you dont have to. It is not really good traffic friendly. Especially if current blocking works.

replies(2): >>45619737 #>>45621922 #

44. hamburgererror ◴[17 Oct 25 09:08 UTC] No.45614677[source]▶

>>45613567 (OP) #

There might be some ideas to dig here: https://news.ycombinator.com/item?id=41923635

45. theginger ◴[17 Oct 25 09:24 UTC] No.45614764[source]▶

>>45613567 (OP) #

If it follows the redirect I would redirect it to random binary files hosted by Amazon, then see if it continues to not require any further action

46. nurettin ◴[17 Oct 25 09:49 UTC] No.45614880[source]▶

>>45613567 (OP) #

What kind of website is this that makes it so lucrative to run so many requests?

47. sim7c00 ◴[17 Oct 25 09:58 UTC] No.45614909[source]▶

>>45613567 (OP) #

if they have some service up on the machines the bot connect from then u can redirect them to themselves.

otherwise, maybe redirect to aws customer portal or something -_- maybe they will stop it if it hit themselves...

48. sim7c00 ◴[17 Oct 25 09:59 UTC] No.45614913{4}[source]▶

thats traffic in the other direction

replies(1): >>45616545 #

49. sim7c00 ◴[17 Oct 25 10:00 UTC] No.45614924[source]▶

this is the best approach honestly. redirect them to some place that undermines their efforts. either back to themselves, their own provider, or nasty crap that no one want to find in their crawler logs.

replies(2): >>45615089 #>>45620298 #

50. hyperknot ◴[17 Oct 25 10:13 UTC] No.45614979[source]▶

>>45613567 (OP) #

Use a simple block rule, not a WAF rule, those are free.

51. ahazred8ta ◴[17 Oct 25 10:17 UTC] No.45614999[source]▶

>>45613567 (OP) #

Silly suggestion: feed them bogus DNS info. See if you can figure out where their DNS requests are coming from.

replies(1): >>45620689 #

52. locusm ◴[17 Oct 25 10:22 UTC] No.45615029[source]▶

>>45613567 (OP) #

I am dealing with a similar situation and kinda screwed up as I managed to get Google Ads suspended due to blocking Singapore. I see a mix of traffic from AWS, Tencent and Huawei cloud at the moment. Currently Im just scanning server logs and blocking ip ranges.

replies(1): >>45619508 #

53. throwaway422432 ◴[17 Oct 25 10:34 UTC] No.45615089{3}[source]▶

Goatse?

Wouldn't recommend Googling it. You either know or just take a guess.

replies(3): >>45619222 #>>45620485 #>>45621395 #

54. lagosfractal42 ◴[17 Oct 25 12:07 UTC] No.45615789{4}[source]▶

You risk flagging real users as bots, which drives down your profits and reputation

replies(1): >>45615960 #

55. bcwhite ◴[17 Oct 25 12:15 UTC] No.45615856[source]▶

>>45613567 (OP) #

I redirect such traffic to a subdomain with an IP address that isn't assigned (or legally assignable). The bots just wait for a response to connection requests but never gets them. This seems to typically cost 10s waiting. The traffic doesn't come to my servers and it doesn't risk legitimate users who might hit it by mistake.

replies(1): >>45621184 #

56. swiftcoder ◴[17 Oct 25 12:26 UTC] No.45615960{5}[source]▶

In this case I don't think they do - unless the legitimate users are also hitting your site at 700 RPS (in which case, the added load from the bot is going to be negligible)

57. bcwhite ◴[17 Oct 25 12:43 UTC] No.45616095[source]▶

>>45613567 (OP) #

An idea I had was a custom kernel that replied ACK (or SYN+ACK) to every TCP packet. All connections would appear to stay open forever, eating all incoming traffic, and never replying, all while using zero resources of the device. Bots might wait minutes (or even forever) per connection.

replies(2): >>45620826 #>>45623338 #

58. swiftcoder ◴[17 Oct 25 13:27 UTC] No.45616545{5}[source]▶

The main joy of a zip bomb is that it doesn't consume much bandwidth - the transferred compressed file is relatively small, and it only becomes huge when the client tries to decompress it in memory afterwards

replies(1): >>45619421 #

59. Hizonner ◴[17 Oct 25 13:28 UTC] No.45616566[source]▶

> 700 RPS is not a big number at all, for sure not enough to qualify as a DoS.

That depends on what's serving the requests. And if you're making the requests, it is your job to know that beforehand.

60. ndriscoll ◴[17 Oct 25 16:12 UTC] No.45618400{3}[source]▶

If it's making outbound requests it might be going through a NAT gateway, in which case response traffic will be expensive.

replies(1): >>45618985 #

61. xena ◴[17 Oct 25 16:12 UTC] No.45618404[source]▶

>>45613567 (OP) #

Main author of Anubis here. Have CloudFlare return a HTTP 200 response instead of a rejection at non-200. That makes the bots stop hammering until they get a 200 response.

replies(3): >>45619654 #>>45621561 #>>45623371 #

62. Rothnargoth ◴[17 Oct 25 16:24 UTC] No.45618570[source]▶

>>45613567 (OP) #

Blocking before the traffic reaches the application servers (what you're doing) is the most effective and cost/time efficient.

It sounds like the bot operator is spending enough on AWS to withstand the current level of abuse reports.

If you really wanted to retaliate, you could try getting a warrant to force AWS to disclose the owners of that AWS instance.

63. jedberg ◴[17 Oct 25 16:29 UTC] No.45618631[source]▶

>>45613567 (OP) #

Tell cloudflare it's abusive, and they will block it outside your account so it doesn't count against you.

64. reconnecting ◴[17 Oct 25 16:35 UTC] No.45618723[source]▶

>>45613567 (OP) #

tirreno(1) guy here.

I'd suggest taking a look into patterns and IP rotation (if any) and perhaps blocking IP CIDR at the web server level, if the range is short.

Why simple deny from 12.123.0.0/16 (Apache) is not working for you?

1. https://github.com/tirrenotechnologies/tirreno

65. AdamJacobMuller ◴[17 Oct 25 16:38 UTC] No.45618776[source]▶

>>45613567 (OP) #

> I've tried 30X redirects (which it follows)

301 response to a selection of very large files hosted by companies you don't like.

When their AWS instances start downloading 70000 windows ISOs in parallel, they might notice.

Hard to do with cloudflare but you can also tar pit them. Accept the request and send a response, one character at a time (make sure you uncork and flush buffers/etc), with a 30 second delay between characters.

700 requests/second with say 10Kb headers/response. Sure is a shame your server is so slow.

replies(4): >>45619101 #>>45621437 #>>45622490 #>>45623571 #

66. n_u ◴[17 Oct 25 16:46 UTC] No.45618867[source]▶

>>45613567 (OP) #

Dumb question but just cuz I didn’t see it mentioned have you tried using a Disallow: / in your robots.txt? Or Crawl-delay: 10? That would be the first thing I would try.

Sometimes these crawlers are just poorly written not malicious. Sometimes it’s both.

I would try a zip bomb next. I know there’s one that is 10 MB over the network and unzips to ~200TB.

replies(1): >>45618990 #

67. heavyset_go ◴[17 Oct 25 16:47 UTC] No.45618883{3}[source]▶

If going stealth means not blatantly DDoS'ing the OP then that's a better outcome than what's currently happening

68. hylaride ◴[17 Oct 25 16:55 UTC] No.45618985{4}[source]▶

I'd be surprised to see a mass-scraping bot behind a NAT gateway. They're probably using public lambdas where they can't even control the egress IPs (unless something has changed in the last 6 months since I last looked) and sending results to a queue or bucket somewhere.

What I'd do is block the AWS AP range at the edge (unless there's something else there that needs access to your site) - you can get regularly updated JSON formatted lists around the internet, or have something match its fingerprint to send it heaps of garbage, like the zip-bombs others have suggested. It could be a recursive "you're abusing my site - go away" or what-have-you. You could also do some-kind of grey-listing, where you limit the speed to a crawl so that each connection just consumes crawler resources and gets little content. If they are tracking this, they'll see the performance issues and maybe adjust.

69. pknerd ◴[17 Oct 25 16:55 UTC] No.45618990[source]▶

It's for crawlers not custom scrapers

replies(1): >>45619157 #

70. ◴[17 Oct 25 16:56 UTC] No.45618994[source]▶

>>45613567 (OP) #

71. pknerd ◴[17 Oct 25 17:02 UTC] No.45619062[source]▶

>>45613567 (OP) #

Redirect it to Trump's website. He will take care of it

72. notatoad ◴[17 Oct 25 17:05 UTC] No.45619101[source]▶

>301 response to a selection of very large files hosted by companies you don't like.

i suggest amazon

replies(2): >>45621192 #>>45623837 #

73. g-mork ◴[17 Oct 25 17:08 UTC] No.45619136[source]▶

>>45613567 (OP) #

CloudFlare page rule or similar to a custom internal URL with the max request timeout jacked up as high as possible (or whatever) set, stick a little async web server behind it that hangs every request after the first byte for say.. 1 hour. Give the aync web server a good chunk of RAM to waste. Most providers don't bill for time, only bytes, and most bots have some timeout tolerance, especially when the status headers and body are already being sent

Similarly, you can also try delivering one byte every 10 seconds or 30 seconds or whatever keeps the client on the other end hanging around for without hitting an internal timeout.

    for char in itertools.repeat(b"FUCKOFF"):
        await resp.send(char)
        await resp.flush()
        await asyncio.sleep(10)
        # etc

In the SMTP years we called this tarpitting IIRC

74. n_u ◴[17 Oct 25 17:10 UTC] No.45619157{3}[source]▶

Respecting robots.txt is a convention not enforced by anything so yes the bot is certainly free to ignore it.

But I’m not sure I understand your distinction. A scraper is a crawler regardless of whether it is “custom”or an off the shelf solution.

The author also said the bot identifed itself as a crawler

> Mozilla/5.0 (compatible; crawler)

75. Rendello ◴[17 Oct 25 17:14 UTC] No.45619222{4}[source]▶

I googled a lot of shock sites after seeing them referenced and not knowing what they were. Luckily Google and Wikipedia tended to shield my innocent eyes while explaining what I should be seeing.

The first goatse I actually saw was in ASCII form, funnily enough.

replies(1): >>45622069 #

76. yabones ◴[17 Oct 25 17:20 UTC] No.45619281[source]▶

>>45613567 (OP) #

Return a 200 with the EICAR test string in the body. Nothing like some data poisoning for some vindictive fun

https://en.wikipedia.org/wiki/EICAR_test_file

replies(1): >>45622139 #

77. ◴[17 Oct 25 17:22 UTC] No.45619307[source]▶

>>45613567 (OP) #

78. somat ◴[17 Oct 25 17:23 UTC] No.45619322{3}[source]▶

xkcd 810 comes to mind. https://xkcd.com/810/

"what if we make the bots go stealthy and indistinguishable from actual human requests?"

"Mission Accomplished"

replies(1): >>45621586 #

79. kachapopopow ◴[17 Oct 25 17:28 UTC] No.45619380[source]▶

>>45613567 (OP) #

redirect it to the client ip, not abuse since you're just an innocent redirect to client-ip service and the (most probable) timeout should consider the service dead after a couple of days or even better they just overload their own servers if there is a page on the client ip or even better is that it causes automatic abuse trigger to kick in and shut down the service.

replies(1): >>45621157 #

80. crazygringo ◴[17 Oct 25 17:32 UTC] No.45619421{6}[source]▶

It's still going in the wrong direction.

replies(1): >>45619595 #

81. CWuestefeld ◴[17 Oct 25 17:32 UTC] No.45619426[source]▶

We've been a similar situation. One thing we considered doing is to give them bad data.

It was pretty clear in our case that they were scraping our site to get our pricing data. Our master catalog had several million SKUs, priced dynamically based on availability, customer contracts, and other factors. And we tried to add some value to the product pages, with relevant recommendations for cross-sells, alternate choices, etc. This was pretty compute-intensive, and the volume of the scraping could amount to a DoS at times. Like, they could bury us in bursts of requests so quickly that our infrastructure couldn't spin up new virtual servers, and once we were buried, it was difficult to dig back out from under the load. We learned a lot during this period, including some very counterintuitive stuff about how some approaches to queuing and prioritizing that appeared sounded great on paper, actually could have unintended effects that made such situations worse.

One strategy we talked about was that, rather than blocking the bad guys, we'd tag the incoming traffic. We couldn't do this perfect accuracy, but the inaccuracy was such that we could at least ensure that it wasn't affecting real customers (because we could always know when it was a real, logged-in user). We realized that we could at least cache the data in the borderline cases so we wouldn't have to recalculate (it was a particularly stupid bot that was attacking us, re-requesting the same stuff many times over); from that it was a small step to see that we could at the same time add a random fudge factor into any numbers, hoping to get to a state where the data did our attacker more harm than good.

We wound up doing what the OP is now doing, working with CloudFlare to identify and mitigate "attacks" as rapidly as possible. But there's no doubt that it cost us a LOT, in terms of developer time, payments to CF, and customer dissatisfaction.

By the way, this was all the more frustrating because we had circumstantial evidence that the attacker was a service contracted by one of our competitors. And if they'd come straight to us to talk about it, we'd have been much happier (and I think they would have been as well) to offer an API through which they could get the catalog data easily and in a way where we don't have to spend all the compute on the value-added stuff we were doing for humans. But of course they'd never come to us, or even admit it if asked, so we were stuck. And while this was going, there was also a case in the courts that was discussed many times here on HN. It was a question about blocking access to public sites, and the consensus here was something like "if you're going to have a site on the web, then it's up to you to ensure that you can support any requests, and if you can't find a way to withstand DoS-level traffic, it's your own fault for having a bad design". So it's interesting today to see that attitudes have changed.

replies(1): >>45620366 #

82. scrps ◴[17 Oct 25 17:37 UTC] No.45619489[source]▶

>>45613567 (OP) #

Singapore's comms regulator bans porn (even possessing it), serve up some softcore to the bot, e-mail the regulator and AWS.

replies(1): >>45619763 #

83. crazygringo ◴[17 Oct 25 17:38 UTC] No.45619508[source]▶

> I managed to get Google Ads suspended due to blocking Singapore

How did that happen, why? I feel like a lot of people here would not want to make the same mistake, so details would be very welcome.

As long as pages weren't being served and so there was never any case of requesting ads but never showing them, I don't understand why Ads would care?

replies(1): >>45623602 #

84. dns_snek ◴[17 Oct 25 17:45 UTC] No.45619595{7}[source]▶

It doesn't matter either way. OP was thinking about ways to consume someone's bandwidth. A zip bomb doesn't consume bandwidth, it consumes computing resources of its recipient when they try to unpack it.

replies(1): >>45620056 #

85. SoftTalker ◴[17 Oct 25 17:45 UTC] No.45619604{3}[source]▶

Threatening to sue is one thing. Actually doing it will cost you time and money. And even if you get a judgement how are you going to collect from some rando in Singapore?

replies(1): >>45619950 #

86. andrewmcwatters ◴[17 Oct 25 17:48 UTC] No.45619654[source]▶

I've also gotten good results just dropping the connection if it hits the application layer, and you can't get CloudFlare to return the desired behavior first.

Not ideal, but it seems to work against primitive bots.

87. CaptainOfCoit ◴[17 Oct 25 17:56 UTC] No.45619737{4}[source]▶

> Especially if current blocking works.

The submission and the context is when current blocking doesn't work...

replies(1): >>45625113 #

88. CaptainOfCoit ◴[17 Oct 25 17:58 UTC] No.45619763[source]▶

To be honest, I'd give that a try too. When someone is bothering you across the internet, the best way to reply is to use their local law system against them, not many other parties will care otherwise.

89. tracker1 ◴[17 Oct 25 18:11 UTC] No.45619950{4}[source]▶

AWS isn't some rando in Singapore.

replies(1): >>45621333 #

90. 2OEH8eoCRo0 ◴[17 Oct 25 18:11 UTC] No.45619953[source]▶

>>45613567 (OP) #

IANAL- sue them for DDoSing and disrupting your service.

> The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.

So you're able to show financial hardship

91. crazygringo ◴[17 Oct 25 18:20 UTC] No.45620056{8}[source]▶

I know. I was pointing out that it doesn't matter what it consumes if it's going the wrong way to begin with.

92. Bender ◴[17 Oct 25 18:27 UTC] No.45620161[source]▶

>>45613567 (OP) #

'Mozilla/5.0 (compatible; crawler)'

Assuming one trusts the user-agent in this case one could reduce the traffic reply to them and avoid touching the disk or any applications in Nginx with something like:

    if ($http_user_agent ~ (crawler|some-other-bot) ) { return 200 '\n\n\n\nBot quota exceeded, check back in 2150 years.\n\n\n\n'; }

There are other variables to look for to see if something is a bot but such things should be very well tested. $http_accept_language, $http_sec_fetch_mode, etc...

I don't use CF but maybe they have a way to block the entire ASN for AWS on your account assuming one does not need inbound connections from them. I just blackhole their CIDR blocks [1] but that won't help someone using a CDN.

[1] - https://ip-ranges.amazonaws.com/ip-ranges.json

93. specialist ◴[17 Oct 25 18:37 UTC] No.45620298{3}[source]▶

Maybe someone will publish a "nastylist" for redirecting bots.

Decades later, I'm still traumatized by goatse, so it'll have to be someone with more fortitude than me.

replies(1): >>45625883 #

94. gwbas1c ◴[17 Oct 25 18:41 UTC] No.45620366{3}[source]▶

> rather than blocking the bad guys, we'd tag the incoming traffic

> had circumstantial evidence that the attacker was a service contracted by one of our competitors

> we'd have been much happier ... to offer an API through which they could get the catalog data easily

Why not feed them bad data?

95. ◴[17 Oct 25 18:51 UTC] No.45620485{4}[source]▶

96. lgats ◴[17 Oct 25 19:05 UTC] No.45620689[source]▶

they're using google dns, unfortunately.

97. fabioyy ◴[17 Oct 25 19:16 UTC] No.45620826[source]▶

no need to mess with the kernel, block on the local machine firewall outgoing RST packet ,create a program that reads raw socket for incoming SYN and answer the syn/ack). but anyway, this technique will not differentiate legitimate connections.

98. burnte ◴[17 Oct 25 19:19 UTC] No.45620857[source]▶

> The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.

It's having negative financial repercussions now. It's not ignorable anymore.

99. lloydatkinson ◴[17 Oct 25 19:28 UTC] No.45620957[source]▶

>>45613567 (OP) #

I blocked the entirety of Singapore via Cloudflare for my personal site. I was seeing persistent weird traffic patterns and sometimes very odd if a little creepy. Not anymore though, the whole country is blocked.

100. lgats ◴[17 Oct 25 19:43 UTC] No.45621146[source]▶

"[AWS has] engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time."

101. lgats ◴[17 Oct 25 19:43 UTC] No.45621157[source]▶

I've tried sending a redirect to http://localhost or http://127.0.0.1 to no avail

replies(1): >>45622798 #

102. lgats ◴[17 Oct 25 19:45 UTC] No.45621184[source]▶

I've attempted a few of these, redirecting to invalid domains or https://en.wikipedia.org/wiki/Black_hole_(networking)#:~:tex...

103. lgats ◴[17 Oct 25 19:46 UTC] No.45621192{3}[source]▶

unfortunately, it seems AWS even has firewalls that will quickly start failing these requests after a few thousand, then they're back up to their high-concurrency rate

104. lgats ◴[17 Oct 25 19:48 UTC] No.45621220[source]▶

I've tried localhost redirects, doesn't impact the speed of their requests, all ports are closed on the suspect machines

105. SoftTalker ◴[17 Oct 25 20:01 UTC] No.45621333{5}[source]▶

AWS isn't doing this. The rando renting the AWS instance in Singapore is.

replies(2): >>45622279 #>>45625768 #

106. nosrepa ◴[17 Oct 25 20:07 UTC] No.45621395{4}[source]▶

The Jason Scott method.

107. gruez ◴[17 Oct 25 20:09 UTC] No.45621437[source]▶

>When their AWS instances start downloading 70000 windows ISOs in parallel, they might notice.

Inbound traffic is free for AWS

replies(1): >>45622803 #

108. geraldcombs ◴[17 Oct 25 20:14 UTC] No.45621506[source]▶

>>45613567 (OP) #

I ran into a similar situation a couple of years ago. It wasn't at the scale you describe, but it was an absurd number of requests for a ~80 MB software installer. I ended up redirecting the offending requests to a file named "please-stop.txt" that contained a short note explaining what was happening and asking them to stop. A short time later they did.

109. Ameo ◴[17 Oct 25 20:19 UTC] No.45621561[source]▶

I thought you quit the orange site for good

110. HPsquared ◴[17 Oct 25 20:21 UTC] No.45621586{4}[source]▶

This has pretty much happened now in the internet at large, and it's kinda sad.

replies(1): >>45624041 #

111. jeroenhd ◴[17 Oct 25 20:37 UTC] No.45621760[source]▶

>>45613567 (OP) #

So far I've been able to get away with just blocking the data centers/countries that cause problems for my servers. Singapore and China are common causes for trouble.

As for trying to get them to stop, maybe redirect the bot to random IP:port combinations in a network that's less friendly to being scanned? I believe certain parts of DoD IP space tends to not look kindly upon attempts to scan them.

Depending on your setup, you could try to poison the bot's DNS for your domain. Send them the IP address of their local police force maybe.

My guess is that this is yet another AI scraper. There are others complaining about this bot online but all they seem to come up with is blocking the ASN in Cloudflare.

If there's no technical solution, if consider consulting with a legal professional to see if you can get Amazon to take action. Lawyers are expensive, but so is a Cloudflare bill when they decide you need to be on the "enterprise" tier.

112. 1a527dd5 ◴[17 Oct 25 20:46 UTC] No.45621859[source]▶

>>45613567 (OP) #

We've seen tons of illegitimate traffic emanating from SG. So much so, that it is a part of the standard WAF country block (along with CN).

replies(1): >>45621905 #

113. leros ◴[17 Oct 25 20:51 UTC] No.45621905[source]▶

That's interesting. I've been getting 1k requests per second from Meta bots from SG.They slowed down after a month of 429 responses.

replies(1): >>45622037 #

114. ◴[17 Oct 25 20:52 UTC] No.45621922{4}[source]▶

115. lucastech ◴[17 Oct 25 20:57 UTC] No.45621963[source]▶

>>45613567 (OP) #

I wrote about this a few weeks ago, because it really is quite insane.

I wish AWS would curtail abuse from their networks. My hope is to build some tools to automate detection and reporting of this sort of abuse, so we can force it into AWS's court.

https://wxp.io/blog/abuse-from-amazon-ip-networks-never-end

116. lucastech ◴[17 Oct 25 21:03 UTC] No.45622037{3}[source]▶

Meta Ireland is just as bad, I've noticed a lot of Tencent from SG.

117. lucastech ◴[17 Oct 25 21:06 UTC] No.45622067{3}[source]▶

Yeah, there are some botnets I've been seeing that are much more stealthy, using 900-3000 IP's with rotating user agents to send enormous amounts of traffic.

I've resorted to blocking entire AS routes to prevent it (fortunately I am mostly hosting US sites with US only residential audiences). I'm not sure who's behind it, but one of the later data centers is oxylabs, so they're probably involved somehow.

https://wxp.io/blog/the-bots-that-keep-on-giving

118. antonymoose ◴[17 Oct 25 21:06 UTC] No.45622069{5}[source]▶

I use the ASCII form to reply to spammers, since it will not trip up on an attachment filter or anything most usually. I get mixed results from them, but the results are usually funny.

119. pickle-wizard ◴[17 Oct 25 21:11 UTC] No.45622128[source]▶

>>45613567 (OP) #

Do you have any legitimate traffic coming from AWS? My thought is to just drop all traffic from their ASN. Once they can't contact you for a while they'll move along and you could unblock.

replies(1): >>45623508 #

120. tetha ◴[17 Oct 25 21:12 UTC] No.45622139[source]▶

Heh, I was wondering if you could do something like SSRF exploits, just the other way around. You know, redirect the bot to <cloud-provider-metadata-api>/shutdown.

Even funnier, include the EICAR test string in the redirect ot the cloud provider metadata. Maybe we could trip some automated compromise detection.

121. Animats ◴[17 Oct 25 21:26 UTC] No.45622279{6}[source]▶

There are ways. You sue AWS and "Does 1-50". Then AWS's lawyers become eager to tell you who misused their service so you can sue the other party. Talk to a lawyer.

122. gitgud ◴[17 Oct 25 21:46 UTC] No.45622490[source]▶

> Accept the request and send a response, one character at a time

Sounds like the opposite of the [1] Slow Loris DDOS attack. Instead of attacking with slow connections, you’re defending with slow connections

[1] https://www.cloudflare.com/en-au/learning/ddos/ddos-attack-t...

replies(1): >>45624096 #

123. eek2121 ◴[17 Oct 25 22:18 UTC] No.45622798{3}[source]▶

That isn't the address you should be using. Use whatever public addresses they are hitting you from.

replies(1): >>45623666 #

124. jacquesm ◴[17 Oct 25 22:18 UTC] No.45622803{3}[source]▶

It's free, but it's not infinite.

125. nijave ◴[17 Oct 25 22:35 UTC] No.45622933[source]▶

>>45613567 (OP) #

Have you tried redirecting the bot in a loop? That should allow it to keep making a ton of requests and hopefully generate traffic they'll have to pay for.

Another idea is replying with large cookies and seeing if the bot saves them and replies with them (once again, to eat traffic)

The idea is to increase their egress to the point someone notices (the bill)

126. jimrandomh ◴[17 Oct 25 22:58 UTC] No.45623117[source]▶

>>45613567 (OP) #

In addition to whatever other mitigations you do, you should put a deny rule for the bot's user-agent in robots.txt, and use a status code of 429 (Too Many Requests), even if the bot doesn't respect these. This will strengthen your case if you need to convince a third party (AWS, or a court, or a different part of the company that's operating the bot) that it's abusive.

127. kristianp ◴[17 Oct 25 23:00 UTC] No.45623137[source]▶

Stupid question, won't that consume 7000 ports on your own box as well?

replies(2): >>45623360 #>>45623415 #

128. Retric ◴[17 Oct 25 23:04 UTC] No.45623177[source]▶

>>45613567 (OP) #

A 100% legal solution is to sue them and name Amazon as a party in the lawsuit.

Through discovery you can get the name of the parties involved from Amazon, but Amazon is very likely to drop them as a client solving the issue.

129. compootr ◴[17 Oct 25 23:12 UTC] No.45623234{4}[source]▶

free workers only get 100k reqs per day or something

130. cactusplant7374 ◴[17 Oct 25 23:16 UTC] No.45623274[source]▶

>>45613567 (OP) #

This sounds like a fun project.

131. pclmulqdq ◴[17 Oct 25 23:25 UTC] No.45623338[source]▶

As I understand it, you can probably do this with XDP in the Linux kernel and it will be pretty cheap.

132. Neywiny ◴[17 Oct 25 23:28 UTC] No.45623360{3}[source]▶

I think it'll eat 7000 connection objects, maybe threads, but they'll all be on port 80 or 443? So if you can keep the overhead of each connection down, presumably easy because you don't need it to be fast, it'll be fine

replies(1): >>45623574 #

133. kingforaday ◴[17 Oct 25 23:30 UTC] No.45623371[source]▶

If you see this, something isn't working with your main site: https://anubis.techaro.lol/

134. sp1982 ◴[17 Oct 25 23:31 UTC] No.45623385[source]▶

>>45613567 (OP) #

If you are using cloudflare, add a rule to do managed JS challenge. Your backend shouldn’t see the requests unless they pass challenge.

135. kijin ◴[17 Oct 25 23:40 UTC] No.45623415{3}[source]▶

Each TCP connection requires a unique combination of (server port, client port). Your server port is fixed: 80 or 443. They need to use a new ephemeral port for each connection.

You will have 7000 sockets (file descriptors), but that's much more manageable than 7000 ports.

136. kijin ◴[17 Oct 25 23:44 UTC] No.45623447{3}[source]▶

Be careful using this if you're behind cloudflare. You might inadvertently bomb your closest ally in the battle.

137. kijin ◴[17 Oct 25 23:54 UTC] No.45623508[source]▶

If it's all from a single AWS region, this is the way to go.

I tend to be careful with residential or office IP ranges. But if it looks like a datacenter, it will be blocked, no second thoughts. Especially if it's a cloud provider that makes it too easy for customers to rotate IPs. Identify the ASN within which they're rotating their IPs, and block it. This is much more effective than blocking based on arbitrary CIDRs or geographical boundaries.

Unless you're running an API for developers, there's no legitimate (non-crawling) reason for someone to request your site from an AWS resource. Even less so for something like Huawei Cloud.

replies(1): >>45623794 #

138. tremon ◴[18 Oct 25 00:05 UTC] No.45623571[source]▶

As an alternative: 301 redirect to an official .sg government site, let local law enforcement deal with it.

139. ◴[18 Oct 25 00:05 UTC] No.45623574{4}[source]▶

140. kijin ◴[18 Oct 25 00:10 UTC] No.45623602{3}[source]▶

Not the parent, but it sounds like they blocked the entire country, including Googlebot's Singaporean IP ranges.

If your server returns different content when Google crawls it compared to when normal users visit, they might suspect that you are trying to game the system. And yes, they do check from multiple locations with non-Googlebot user agents.

I'm not sure if showing an error page also counts as returning different content, but I guess the problem could be exacerbated by any content you include in the error page unless you're careful with the response code. Definitely don't make it too friendly. Whitelist important business partners.

141. redleader55 ◴[18 Oct 25 00:21 UTC] No.45623666{4}[source]▶

And random ports. If you only hit 80/443, they might be closed

142. jihadjihad ◴[18 Oct 25 00:32 UTC] No.45623737{3}[source]▶

When the WAF drops packets, how does pricing work? I am assuming there is still a non-zero cost to handling that? Kind of sounded from OP that they are looking to shake the monkey off their back for good, and cheaply.

143. mat_epice ◴[18 Oct 25 00:42 UTC] No.45623794{3}[source]▶

> there's no legitimate (non-crawling) reason for someone to request your site from an AWS resource

I used to run an X instance in the cloud that I would sometimes browse websites from. It sucked but it was also legitimate.

replies(1): >>45623880 #

144. knowitnone3 ◴[18 Oct 25 00:49 UTC] No.45623837{3}[source]▶

Microsoft

145. kijin ◴[18 Oct 25 00:57 UTC] No.45623880{4}[source]▶

"Legitimate" is relative here. I would count you as using unusual software to hide your actual source address. Not a huge concern because if you're doing that, I assume you also know how to move around to avoid getting blocked.

In fact, the ability to move to a different cloud on short notice is also part of the CAPTCHA, because large cloud-based botnets usually can't. They'd get instabanned if they tried to move their crawling boxes to something like DigitalOcean.

146. lotsofpulp ◴[18 Oct 25 01:24 UTC] No.45624041{5}[source]▶

“Constructive” and “Helpful” are unfortunately not out weighed by garbage.

147. tliltocatl ◴[18 Oct 25 01:35 UTC] No.45624096{3}[source]▶

That's why it is actually sometimes called inverse slow loris.

replies(1): >>45624618 #

148. janis1234 ◴[18 Oct 25 01:38 UTC] No.45624117[source]▶

>>45613567 (OP) #

Have you considered EBPF filter that looks for 'Mozilla/5.0 (compatible; crawler)' and drops packets from that IP for 1 hr where it just straight drops packets. I.e, this is probably best way to handle bots, don't even reply so they have to timeout which usually is a few seconds.

149. amy_petrik ◴[18 Oct 25 03:18 UTC] No.45624618{4}[source]▶

it's called the slow sirol in my circles

150. throwaway127482 ◴[18 Oct 25 03:40 UTC] No.45624708[source]▶

>>45613567 (OP) #

Completely and utterly off topic: why on earth does HN use a dim gray font for the post description? It's so hard to read. I understand why downvoted comments are grayed out but why the post description???

151. hshdhdhehd ◴[18 Oct 25 05:20 UTC] No.45625113{5}[source]▶

> Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

That is strictly less resource intensive than serving 200 and some challenge.

152. rkagerer ◴[18 Oct 25 05:56 UTC] No.45625267[source]▶

>>45613567 (OP) #

I had a similar problem back in 2018, though at a smaller scale.

I wrote a quick-and-dirty program that reads the authoritative list of all AWS IP ranges from https://ip-ranges.amazonaws.com/ip-ranges.json (more about that URL at the blog post https://aws.amazon.com/blogs/aws/aws-ip-ranges-json/), and creates rules in Windows Firewall to simply block all of them. Granted, it was a sledgehammer, but it worked well enough.

Here's the README.md I wrote for the program, though I never got around to releasing the the code: https://markdownpastebin.com/?id=22eadf6c608448a98b6643606d1...

It ran for some years as a scheduled task on a small handful of servers, but I'm not sure if it's still in use today or even works anymore. If there's enough interest I might consider publishing the code (or sharing it with someone who wants to pick up the mantle). Alternatively it wouldn't be hard for someone to recreate that effort.

G'luck!

153. tushar-r ◴[18 Oct 25 07:33 UTC] No.45625607[source]▶

>>45613567 (OP) #

Block the AWS IP ranges. You will have reasonably good results blocking all datacenter ranges - cloud providers, VPSs etc., if you don't expect traffic from them. You can get the ranges from Udger (paid) and it isn't very bad w.r.to false positives. Alternatively just whitelist expected regions and block everything else. More false positives prone, but easier.

154. impossiblefork ◴[18 Oct 25 08:15 UTC] No.45625768{6}[source]▶

It's AWS's system and they have been informed that the spam/DDOS is ongoing.

They have control of what goes on on their computers and they are responsible.

155. realaaa ◴[18 Oct 25 08:21 UTC] No.45625796[source]▶

>>45613567 (OP) #

zip bomb it yeah !

156. sim7c00 ◴[18 Oct 25 08:39 UTC] No.45625883{4}[source]▶

goatse, lemonparty, meatspin. take ur pick of the gross but clearnetable things.

mind you before google and the likes and the great purge of internet, these things were mild and humorous...