Ask HN: How to stop an AWS bot sending 2B requests/month?

1. swiftcoder ◴[17 Oct 25 06:59 UTC] No.45614001[source]▶

Making the obviously-abusive bot prohibitively expensive is one way to go, if you control the terminating server.

gzip bomb is good if the bot happens to be vulnerable, but even just slowing down their connection rate is often sufficient - waiting just 10 seconds before responding with your 404 is going to consume ~7,000 ports on their box, which should be enough to crash most linux processes (nginx + mod-http-echo is a really easy way to set this up)

replies(7): >>45614138 #>>45614240 #>>45614367 #>>45614560 #>>45619426 #>>45623137 #>>45628852 #

2. Orochikaku ◴[17 Oct 25 07:30 UTC] No.45614138[source]▶

>>45614001 (TP) #

Thinking along the same lines a PoW check like like anubis[1] may work for OP as well.

[1] https://github.com/TecharoHQ/anubis

replies(2): >>45614636 #>>45626996 #

3. lagosfractal42 ◴[17 Oct 25 07:49 UTC] No.45614240[source]▶

>>45614001 (TP) #

This kind of reasoning assumes the bot continues to be non-stealthy

replies(4): >>45614297 #>>45618883 #>>45619322 #>>45622067 #

4. swiftcoder ◴[17 Oct 25 08:00 UTC] No.45614297[source]▶

>>45614240 #

I mean, forcing them to spend engineering effort the make their bot stealthy (or to be able to maintains 10's of thousands of open ports), is still driving up their costs, so I'd count it as a win. The OP doesn't say why the bot is hitting their endpoints, but I doubt the bot is a profit centre for the operator.

replies(1): >>45615789 #

5. mkj ◴[17 Oct 25 08:15 UTC] No.45614367[source]▶

>>45614001 (TP) #

AWS customers have to pay for outbound traffic. Is there a way to get them to send you (or cloudflare) huge volumes of traffic?

replies(2): >>45614423 #>>45614438 #

6. horseradish7k ◴[17 Oct 25 08:27 UTC] No.45614423[source]▶

>>45614367 #

yeah, could use a free worker

replies(1): >>45623234 #

7. _pdp_ ◴[17 Oct 25 08:29 UTC] No.45614438[source]▶

>>45614367 #

A KB zip file can expand to giga / petabytes through recursive nesting - though it depends on their implementation.

replies(1): >>45614913 #

8. gildas ◴[17 Oct 25 08:51 UTC] No.45614560[source]▶

>>45614001 (TP) #

Great idea, some people have already implemented it for the same type of need, it would seem (see the list of user agents in the source code). Implementation seems simple.

https://github.com/0x48piraj/gz-bomb/blob/master/gz-bomb-ser...

replies(1): >>45623447 #

9. hshdhdhehd ◴[17 Oct 25 09:02 UTC] No.45614636[source]▶

>>45614138 #

Avoid if you dont have to. It is not really good traffic friendly. Especially if current blocking works.

replies(2): >>45619737 #>>45621922 #

10. sim7c00 ◴[17 Oct 25 09:59 UTC] No.45614913{3}[source]▶

>>45614438 #

thats traffic in the other direction

replies(1): >>45616545 #

11. lagosfractal42 ◴[17 Oct 25 12:07 UTC] No.45615789{3}[source]▶

>>45614297 #

You risk flagging real users as bots, which drives down your profits and reputation

replies(1): >>45615960 #

12. swiftcoder ◴[17 Oct 25 12:26 UTC] No.45615960{4}[source]▶

>>45615789 #

In this case I don't think they do - unless the legitimate users are also hitting your site at 700 RPS (in which case, the added load from the bot is going to be negligible)

13. swiftcoder ◴[17 Oct 25 13:27 UTC] No.45616545{4}[source]▶

>>45614913 #

The main joy of a zip bomb is that it doesn't consume much bandwidth - the transferred compressed file is relatively small, and it only becomes huge when the client tries to decompress it in memory afterwards

replies(1): >>45619421 #

14. heavyset_go ◴[17 Oct 25 16:47 UTC] No.45618883[source]▶

>>45614240 #

If going stealth means not blatantly DDoS'ing the OP then that's a better outcome than what's currently happening

15. somat ◴[17 Oct 25 17:23 UTC] No.45619322[source]▶

>>45614240 #

xkcd 810 comes to mind. https://xkcd.com/810/

"what if we make the bots go stealthy and indistinguishable from actual human requests?"

"Mission Accomplished"

replies(1): >>45621586 #

16. crazygringo ◴[17 Oct 25 17:32 UTC] No.45619421{5}[source]▶

>>45616545 #

It's still going in the wrong direction.

replies(1): >>45619595 #

17. CWuestefeld ◴[17 Oct 25 17:32 UTC] No.45619426[source]▶

>>45614001 (TP) #

We've been a similar situation. One thing we considered doing is to give them bad data.

It was pretty clear in our case that they were scraping our site to get our pricing data. Our master catalog had several million SKUs, priced dynamically based on availability, customer contracts, and other factors. And we tried to add some value to the product pages, with relevant recommendations for cross-sells, alternate choices, etc. This was pretty compute-intensive, and the volume of the scraping could amount to a DoS at times. Like, they could bury us in bursts of requests so quickly that our infrastructure couldn't spin up new virtual servers, and once we were buried, it was difficult to dig back out from under the load. We learned a lot during this period, including some very counterintuitive stuff about how some approaches to queuing and prioritizing that appeared sounded great on paper, actually could have unintended effects that made such situations worse.

One strategy we talked about was that, rather than blocking the bad guys, we'd tag the incoming traffic. We couldn't do this perfect accuracy, but the inaccuracy was such that we could at least ensure that it wasn't affecting real customers (because we could always know when it was a real, logged-in user). We realized that we could at least cache the data in the borderline cases so we wouldn't have to recalculate (it was a particularly stupid bot that was attacking us, re-requesting the same stuff many times over); from that it was a small step to see that we could at the same time add a random fudge factor into any numbers, hoping to get to a state where the data did our attacker more harm than good.

We wound up doing what the OP is now doing, working with CloudFlare to identify and mitigate "attacks" as rapidly as possible. But there's no doubt that it cost us a LOT, in terms of developer time, payments to CF, and customer dissatisfaction.

By the way, this was all the more frustrating because we had circumstantial evidence that the attacker was a service contracted by one of our competitors. And if they'd come straight to us to talk about it, we'd have been much happier (and I think they would have been as well) to offer an API through which they could get the catalog data easily and in a way where we don't have to spend all the compute on the value-added stuff we were doing for humans. But of course they'd never come to us, or even admit it if asked, so we were stuck. And while this was going, there was also a case in the courts that was discussed many times here on HN. It was a question about blocking access to public sites, and the consensus here was something like "if you're going to have a site on the web, then it's up to you to ensure that you can support any requests, and if you can't find a way to withstand DoS-level traffic, it's your own fault for having a bad design". So it's interesting today to see that attitudes have changed.

replies(1): >>45620366 #

18. dns_snek ◴[17 Oct 25 17:45 UTC] No.45619595{6}[source]▶

>>45619421 #

It doesn't matter either way. OP was thinking about ways to consume someone's bandwidth. A zip bomb doesn't consume bandwidth, it consumes computing resources of its recipient when they try to unpack it.

replies(2): >>45620056 #>>45625962 #

19. CaptainOfCoit ◴[17 Oct 25 17:56 UTC] No.45619737{3}[source]▶

>>45614636 #

> Especially if current blocking works.

The submission and the context is when current blocking doesn't work...

replies(1): >>45625113 #

20. crazygringo ◴[17 Oct 25 18:20 UTC] No.45620056{7}[source]▶

>>45619595 #

I know. I was pointing out that it doesn't matter what it consumes if it's going the wrong way to begin with.

21. gwbas1c ◴[17 Oct 25 18:41 UTC] No.45620366[source]▶

>>45619426 #

> rather than blocking the bad guys, we'd tag the incoming traffic

> had circumstantial evidence that the attacker was a service contracted by one of our competitors

> we'd have been much happier ... to offer an API through which they could get the catalog data easily

Why not feed them bad data?

replies(1): >>45627381 #

22. HPsquared ◴[17 Oct 25 20:21 UTC] No.45621586{3}[source]▶

>>45619322 #

This has pretty much happened now in the internet at large, and it's kinda sad.

replies(1): >>45624041 #

23. ◴[17 Oct 25 20:52 UTC] No.45621922{3}[source]▶

>>45614636 #

24. lucastech ◴[17 Oct 25 21:06 UTC] No.45622067[source]▶

>>45614240 #

Yeah, there are some botnets I've been seeing that are much more stealthy, using 900-3000 IP's with rotating user agents to send enormous amounts of traffic.

I've resorted to blocking entire AS routes to prevent it (fortunately I am mostly hosting US sites with US only residential audiences). I'm not sure who's behind it, but one of the later data centers is oxylabs, so they're probably involved somehow.

https://wxp.io/blog/the-bots-that-keep-on-giving

25. kristianp ◴[17 Oct 25 23:00 UTC] No.45623137[source]▶

>>45614001 (TP) #

Stupid question, won't that consume 7000 ports on your own box as well?

replies(3): >>45623360 #>>45623415 #>>45627908 #

26. compootr ◴[17 Oct 25 23:12 UTC] No.45623234{3}[source]▶

>>45614423 #

free workers only get 100k reqs per day or something

27. Neywiny ◴[17 Oct 25 23:28 UTC] No.45623360[source]▶

>>45623137 #

I think it'll eat 7000 connection objects, maybe threads, but they'll all be on port 80 or 443? So if you can keep the overhead of each connection down, presumably easy because you don't need it to be fast, it'll be fine

replies(1): >>45623574 #

28. kijin ◴[17 Oct 25 23:40 UTC] No.45623415[source]▶

>>45623137 #

Each TCP connection requires a unique combination of (server port, client port). Your server port is fixed: 80 or 443. They need to use a new ephemeral port for each connection.

You will have 7000 sockets (file descriptors), but that's much more manageable than 7000 ports.

29. kijin ◴[17 Oct 25 23:44 UTC] No.45623447[source]▶

>>45614560 #

Be careful using this if you're behind cloudflare. You might inadvertently bomb your closest ally in the battle.

30. ◴[18 Oct 25 00:05 UTC] No.45623574{3}[source]▶

>>45623360 #

31. lotsofpulp ◴[18 Oct 25 01:24 UTC] No.45624041{4}[source]▶

>>45621586 #

“Constructive” and “Helpful” are unfortunately not out weighed by garbage.

32. hshdhdhehd ◴[18 Oct 25 05:20 UTC] No.45625113{4}[source]▶

>>45619737 #

> Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

That is strictly less resource intensive than serving 200 and some challenge.

replies(1): >>45626544 #

33. sim7c00 ◴[18 Oct 25 08:57 UTC] No.45625962{7}[source]▶

>>45619595 #

i wouldnt assume someone sending 700 req per minute or so to a single domain repeatedly (likely to the same resources) will bother opening zip files.

the bot in the article is likely being tested (as author noted), or its a very bad 'stresser'.

if it was looking for content grabbing it will access differently. (grab resources once and be on its way).

its not bad to host zip bombs tho, for the content grabbers :D nomnom.

saw an article about a guy on here who generated arbitrary pngs or so. also classy haha.

if u have a friendly vps provider who gives unlimited bandwidth these options can be fun. u can make a dashboard which bot has consumed the most junk.

replies(2): >>45626139 #>>45626681 #

34. ruined ◴[18 Oct 25 09:56 UTC] No.45626139{8}[source]▶

>>45625962 #

nearly every http response is gzipped. unpacking automatically is a default feature of every http client.

35. CaptainOfCoit ◴[18 Oct 25 11:38 UTC] No.45626544{5}[source]▶

>>45625113 #

Right, but if you re-read the submission, OP already tried that and found the costs to be potentially be too high, and are looking for alternatives...

36. mjmas ◴[18 Oct 25 12:07 UTC] No.45626681{8}[source]▶

>>45625962 #

This is using the builtin compression in http:

  Transfer-Encoding: gzip

37. winnie_ua ◴[18 Oct 25 12:58 UTC] No.45626996[source]▶

>>45614138 #

It was blocking me from accessing GNOME's gitlab instance from my cell phone.

So it mistakedly flagged me as bot. IDK. And it forces legitimate users to wait a while. Not great UX.

38. CWuestefeld ◴[18 Oct 25 13:50 UTC] No.45627381{3}[source]▶

>>45620366 #

We didn't like the ethics of it, especially since we couldn't guarantee that the bogus data was going only to the attacker (rather than to innocent but not-yet-authenticated "general public").

replies(1): >>45629020 #

39. swiftcoder ◴[18 Oct 25 14:58 UTC] No.45627908[source]▶

>>45623137 #

7000 sockets, at any rate, but provided you've anticipated the need, this isn't challenging to support (and nginx is very good at handling large numbers of open sockets)

40. SergeAx ◴[18 Oct 25 17:14 UTC] No.45628852[source]▶

>>45614001 (TP) #

Wouldn't it consume the same number of connections on my server?

41. IshKebab ◴[18 Oct 25 17:35 UTC] No.45629020{4}[source]▶

>>45627381 #

I guess you could have required login to show prices to suspicious requests. Then it shouldn't affect most people and if it accidentally does the worst outcome is they need to log in.