What happens to latency if service time is cut in half (2022)

(pveentjer.github.io)

130 points luu | 2 comments | 18 Jun 24 06:10 UTC | HN request time: 0.451s | source

Show context

kqr ◴[18 Jun 24 09:52 UTC] No.40715797[source]▶

Another of those somewhat counter-intuitive results is the answer to the question "how much do we need to scale up to avoid response time regression when the request rate is doubled?"

It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.

To many people that sounds weird; it sounds like we got something for nothing. If I invite twice as many people to a party and buy twice as many cookies, it's not like each guest will get twice as many cookies – that just leaves the originally planned number of cookies for each guest.

But for response time it comes back to the first equation in TFA:

    T = 1/μ · 1/(1 - ρ)

Doubling both arrival rate and maximum service rate leaves ρ – and the second factor with it – unchanged, but still halves the 1/μ factor, resulting in half the response time.

The appropriate amount to scale by is the k that solves the equation we get when we set the old response time T at request rate λ equal to the one at request rate 2λ and kμ processing power. This is something like

    T = 1/μ · 1/(1 - λ/μ) = 1/kμ · 1/(1 - 2λ/kμ)

but rearranges to the much simpler

    k = ρ + 1

which a colleague of mine told me interprets intuitively as "the processing power that needs to be added is exactly that which will handle an additional unit of the current utilisation on top of the current utilisation, i.e. twice as much."

This is mostly good news for people doing capacity planning in advance of events etc. If you run your systems at reasonable utilisation levels normally, you don't actually need that much additional capacity to handle peak loads.

replies(6): >>40716157 #>>40716287 #>>40716860 #>>40717110 #>>40717333 #>>40719853 #

krisoft ◴[18 Jun 24 12:11 UTC] No.40716860[source]▶

>>40715797 #

> It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.

I don't understand what you are saying. Are you talking about the time the request is buffered in some queue on average assuming they are arriving randomly? Or something like that?

Here is what I'm thinking. We are operating a hospital which does only one type of surgery which last an hour exactly. (Presumably it is a veterinary practice for spherical cows.) A fully staffed operating room can operate on 24 spherical cows a day. If due to a spherical cow jamboree we expect more patients and we set up a second operating theatre we will be able to serve 48 of them a day. But we are still serving them for an hour each. (because that is how long the operation takes.)

Even if we are talking about latency when 24 cows show up at the stroke of midnight to the one operating room hospital they each will be served on average in 12.5h. Same average if 48 cows show up to the two operating room hospital.

So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.

replies(2): >>40716961 #>>40717412 #

lesuorac ◴[18 Jun 24 12:23 UTC] No.40716961[source]▶

>>40716860 #

> So what am I thinking wrong here? Is "scale to twice the processing power" not the same as getting a second operating room? I'm not seeing where "we will actually be serving each request in half the time" comes from.

Single Core vs Multi Core (ish).

With a Single thread you must work twice as fast to handle the increased load which also means the work is done at half the speed.

With Multi threading you can shuffle two units of work out at the same time so it's twice the load but the same speed.

To go back to the cow analogy, rather than adding 24 rooms (more threads) you give each surgeon a powersaw and they work twice as fast.

So if you scale the processing power per thread up then the time goes down, if you scale the processing power by adding threads (cores) the time stays the same (ish).

replies(1): >>40717086 #

1. krisoft ◴[18 Jun 24 12:36 UTC] No.40717086[source]▶

>>40716961 #

> To go back to the cow analogy, rather than adding 24 rooms (more threads) you give each surgeon a powersaw and they work twice as fast.

I was thinking that maybe that is what we are talking about. But convinced myself otherwise. Surely we don't need that much math to show if we cut the processing time in half the processing time will be cut in half. But if that is all we are saying I guess I can accept that as quite trivial.

replies(1): >>40717597 #

2. kqr ◴[18 Jun 24 13:26 UTC] No.40717597[source]▶

>>40717086 (TP) #

Cut the processing time in half while doubling the load!

The average cow, pre-jamboree, spends one hour in the hospital, including time waiting for a slot in the OR. Then you give the surgeon a power saw that allows him to complete the job in half the time, but he also gets twice as many cows to work on.

Most people's intuition would tell them the cows would still spend an hour in the hospital (the doubling in work rate canceled by the doubling in work amount), but actually now it takes half an hour -- regardless of how swamped the surgeon was initially.

↑