Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

(www.tomshardware.com)

521 points hd4 | 1 comments | 20 Oct 25 12:31 UTC | HN request time: 0s | source

Paper: https://dl.acm.org/doi/10.1145/3731569.3764815

Show context

kilotaras ◴[20 Oct 25 15:05 UTC] No.45644776[source]▶

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

replies(5): >>45645037 #>>45647752 #>>45647863 #>>45651559 #>>45653363 #

bee_rider ◴[20 Oct 25 19:06 UTC] No.45647863[source]▶

>>45644776 #

I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?

I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?

(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)

replies(6): >>45648058 #>>45648291 #>>45648653 #>>45649219 #>>45650208 #>>45653517 #

svachalek ◴[20 Oct 25 20:53 UTC] No.45649219[source]▶

>>45647863 #

Models take a lot of VRAM which is tightly coupled to the GPU so yeah, it's basically sitting there with the model waiting for use. I'm sure they probably do idle out but a few minutes of idle time is a lot of waste--possibly the full 82% mentioned. In this case they optimized by letting the GPUs load multiple models and sharing the load out by token.

replies(2): >>45650833 #>>45651174 #

jychang ◴[21 Oct 25 00:39 UTC] No.45651174[source]▶

>>45649219 #

They definitely won't idle out- if they idle out, it'll take on the order of up to 60 seconds to load the model back into VRAM, depending on the model.

That's an eternity for a request. I highly doubt they will timeout any model they serve.

replies(5): >>45651578 #>>45651815 #>>45652636 #>>45653451 #>>45656541 #

godelski ◴[21 Oct 25 05:05 UTC] No.45652636{3}[source]▶

>>45651174 #

  > That's an eternity for a request. I highly doubt they will timeout any model they serve.

That's what easing functions are for.

Let's say 10 GPUs are in use. You keep another 3 with the model loaded. If demand increases slowly you slowly increase your headroom. If demand increases rapidly, you also increase rapidly.

The correct way to do this is more complicated and you should model based on your usage history, but if you have sufficient headroom then very few should be left idle. Remember that these models do requests in batches.

If they don't timeout models, they're throwing money down the drain. Though that wouldn't be uncommon.

replies(2): >>45652744 #>>45654780 #

1. jgalt212 ◴[21 Oct 25 11:59 UTC] No.45654780{4}[source]▶

>>45652636 #

The thundering herd breaks this scheme.

↑