←back to thread

130 points luu | 1 comments | | HN request time: 0s | source
Show context
kqr ◴[] No.40715797[source]
Another of those somewhat counter-intuitive results is the answer to the question "how much do we need to scale up to avoid response time regression when the request rate is doubled?"

It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.

To many people that sounds weird; it sounds like we got something for nothing. If I invite twice as many people to a party and buy twice as many cookies, it's not like each guest will get twice as many cookies – that just leaves the originally planned number of cookies for each guest.

But for response time it comes back to the first equation in TFA:

    T = 1/μ · 1/(1 - ρ)
Doubling both arrival rate and maximum service rate leaves ρ – and the second factor with it – unchanged, but still halves the 1/μ factor, resulting in half the response time.

The appropriate amount to scale by is the k that solves the equation we get when we set the old response time T at request rate λ equal to the one at request rate 2λ and kμ processing power. This is something like

    T = 1/μ · 1/(1 - λ/μ) = 1/kμ · 1/(1 - 2λ/kμ)
but rearranges to the much simpler

    k = ρ + 1
which a colleague of mine told me interprets intuitively as "the processing power that needs to be added is exactly that which will handle an additional unit of the current utilisation on top of the current utilisation, i.e. twice as much."

This is mostly good news for people doing capacity planning in advance of events etc. If you run your systems at reasonable utilisation levels normally, you don't actually need that much additional capacity to handle peak loads.

replies(6): >>40716157 #>>40716287 #>>40716860 #>>40717110 #>>40717333 #>>40719853 #
mgerdts ◴[] No.40717110[source]
> but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.

Not necessarily. If processing power is increased by doubling the clock of a processor or using a hard disk that spins and seeks twice as fast, this may be the case.

But we all know that when you have a single threaded work item, adding a second core does not cause the single threaded work to complete in half the time. If the arrival rate is substantially lower than 1/S, the second core will be of negligible value, and maybe of negative value due to synchronization overhead. This overhead is unlikely to be seen when doubling from 1 to 2, but is more likely at high levels of scaling.

If the processor is saturated, service time includes queue time, and service time is dominated by queue time, increasing processing power by doubling the number of processors may make it so that service time can almost be cut in half. How close to half depends on the ratio of queue time to processing time.

replies(1): >>40717725 #
1. kqr ◴[] No.40717725[source]
This is a useful addition. Yes, the above reasoning was assuming that we

(a) had located a bottleneck,

(b) were planning to vertically scale the capacity of that bottleneck, and

(c) in doing so won't run into a different bottleneck!

It is still useful for many cases of horizontal scaling because from sufficiently far away, c identical servers looks a lot like a single server at c times the capacity of a single one. Many applications I encounter in practise does not require one to be very far away to make that simplifying assumption.

> How close to half depends on the ratio of queue time to processing time.

I don't think this is generally true, but I think I see what you're going for: you're using queue length as a proxy for how far away we have to be to pretend horizontal scaling is a useful proxy for vertical scaling, right?