What happens to latency if service time is cut in half (2022)

(pveentjer.github.io)

130 points luu | 3 comments | 18 Jun 24 06:10 UTC | HN request time: 0.646s | source

Show context

vitus ◴[18 Jun 24 14:57 UTC] No.40718554[source]▶

> a utilization of 0.90

> So we went from a latency of 1 second to 0.091 seconds which is an 11 times improvement.

There's your problem -- you should never allow unbounded queue growth at high utilization. Going from 80% to 90% utilization doubles your average wait times. We could similarly make this number arbitrarily large by pushing that utilization closer to 1, e.g. "We halved service time at 99.99% utilization and saw a 10000x improvement". But that's not interesting, as your users will complain that your service is unusable under heavy load far before you get to that point.

The typical "fix" is to add load shedding (e.g. based on queue length) combined with some intelligent backoff logic at the client (to reduce congestion), and call it a day. This has its own downsides, e.g. increased latency for everyone in cases of overload. Or, if your configured queue length is too large, you get bufferbloat.

(I have seen an argument for using LIFO instead of FIFO, which achieves much more predictable median performance at the expense of causing unbounded badness in the worst case. For this, your client needs to set deadlines, which it should probably be doing anyways.)

replies(1): >>40719166 #

1. paulsutter ◴[18 Jun 24 15:44 UTC] No.40719166[source]▶

>>40718554 #

A counterintuitive idea would be to use the machines for two purposes.

- a high priority service with arbitrarily low utilization with low latency, and

- a low priority high utilization service like batch processing to keep overall use near 100%

replies(1): >>40719391 #

2. vitus ◴[18 Jun 24 16:05 UTC] No.40719391[source]▶

>>40719166 (TP) #

Why is this counterintuitive? You're describing scheduling -- your high priority service has the ability to preempt the low priority service and use more of your CPU (or whatever your bottleneck is) if there are pending requests.

If you have a high-availability, latency-sensitive service that's processing incoming requests in realtime, you want it to be overprovisioned (for limiting queue lengths, the ability to soak up DoS before other systems kick in, etc). But that's wasteful if you're idle half the time. (Or generally, off-peak -- you might explore spinning up / down nodes dynamically, but you probably still want some baseline amount of capacity in each of your availability zones.)

Another dimension you could explore (if your problem space allows it) is varying the amount of work you perform depending on your current load -- for instance, you could imagine a video server that serves 4k at light load, then degrades to 1080p at moderate load, and then degrades even further to 720p at heavy load.

That said, it depends on your cost model. If you're dealing with bare metal (or you are the cloud) where your principal operating expenses are space and power, then you might be fine with servers idling if power is expensive. If depreciation of your servers drives your costs (or if you get charged the same amount regardless of your actual utilization), then you might want to run your servers hotter.

replies(1): >>40719912 #

3. angarg12 ◴[18 Jun 24 17:00 UTC] No.40719912[source]▶

>>40719391 #

If you have two workloads with very different operational profiles (e.g. low latency low volume vs high latency high volume) you could also process each workload in a different copy of the system, sorta-kinda the hot-path/cold-path pattern.

It might sound like duplicating your system would make it more complex, but having two systems each doing one very specific thing tends to be simpler than one single system doing very different things.

↑