Most active commenters

jychang(3)

Popular/hot comments

>>45651174 #
>>45653517 #

←back to thread

Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

(www.tomshardware.com)

Paper: https://dl.acm.org/doi/10.1145/3731569.3764815

Show context

kilotaras ◴[20 Oct 25 15:05 UTC] No.45644776[source]▶

>>45643163 (OP) #

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

replies(5): >>45645037 #>>45647752 #>>45647863 #>>45651559 #>>45653363 #

1. bee_rider ◴[20 Oct 25 19:06 UTC] No.45647863[source]▶

>>45644776 #

I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?

I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?

(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)

replies(6): >>45648058 #>>45648291 #>>45648653 #>>45649219 #>>45650208 #>>45653517 #

2. make3 ◴[20 Oct 25 19:24 UTC] No.45648058[source]▶

>>45647863 (TP) #

the models are huge, so not a single (latest gen) one can fit on a single GPU.

It's likely that these are small unpopular (non flagship) models, or that they only pack eg one layer of each model.

replies(1): >>45649253 #

3. smallnix ◴[20 Oct 25 19:41 UTC] No.45648291[source]▶

>>45647863 (TP) #

> I guess I’d assumed this sort of thing would be allocated dynamically

At the scale of a hyperscaler I think Alibaba is the one that would be doing that. AWS, Azure and I assume Alibaba do lease/rent data centers, but someone has to own the servers / GPU racks. I know there are specialized companies like nscale (and more further down the chain) in the mix, but I always assumed they only lease out fixed capacity.

4. yorwba ◴[20 Oct 25 20:08 UTC] No.45648653[source]▶

>>45647863 (TP) #

The paper is about techniques to do that dynamic allocation to maximize utilization without incurring unacceptable latencies. If you let a GPU sit idle for several minutes after serving a single request, you're setting money on fire. So they reuse it for a different model as soon as possible, starting even before the first request is finished, because: If you don't have a dedicated GPU for a model, are you going to wait for a multi-gigabyte transfer before each request? So they have a dedicated GPU (or two, one for prefill, one for decode) for a group of models that are processed in an interleaved fashion, scheduled such that they stay within the latency budget.

5. svachalek ◴[20 Oct 25 20:53 UTC] No.45649219[source]▶

>>45647863 (TP) #

Models take a lot of VRAM which is tightly coupled to the GPU so yeah, it's basically sitting there with the model waiting for use. I'm sure they probably do idle out but a few minutes of idle time is a lot of waste--possibly the full 82% mentioned. In this case they optimized by letting the GPUs load multiple models and sharing the load out by token.

replies(2): >>45650833 #>>45651174 #

6. svachalek ◴[20 Oct 25 20:55 UTC] No.45649253[source]▶

>>45648058 #

Per the very short article, the solution was to pack multiple models per GPU.

replies(1): >>45650502 #

7. citizenpaul ◴[20 Oct 25 22:25 UTC] No.45650208[source]▶

>>45647863 (TP) #

>Do the GPUs just sit there with the models on them when the models are not in use

I've assumed that as well. It makes sense to me since loading up a model locally takes a while. I wonder if there is some sort of better way I'm not in the know about. That or too GPU poor to know about.

8. make3 ◴[20 Oct 25 23:00 UTC] No.45650502{3}[source]▶

>>45649253 #

yes but that could mean a layer per model

9. andy_ppp ◴[20 Oct 25 23:46 UTC] No.45650833[source]▶

>>45649219 #

How does this work with anything but trivially small context sizes!?

replies(1): >>45651181 #

10. jychang ◴[21 Oct 25 00:39 UTC] No.45651174[source]▶

>>45649219 #

They definitely won't idle out- if they idle out, it'll take on the order of up to 60 seconds to load the model back into VRAM, depending on the model.

That's an eternity for a request. I highly doubt they will timeout any model they serve.

replies(5): >>45651578 #>>45651815 #>>45652636 #>>45653451 #>>45656541 #

11. jychang ◴[21 Oct 25 00:39 UTC] No.45651181{3}[source]▶

>>45650833 #

Tensor parallelism, so you only need to store a fraction of kv cache per gpu.

12. all2 ◴[21 Oct 25 01:46 UTC] No.45651578{3}[source]▶

>>45651174 #

If I had to handle this problem, I'd do some kind of "split on existing loaded GPUs" for new sessions, and then when some cap is hit, spool an additional GPU in the background and the transfer the new session to that GPU as soon as the model is loaded.

I'd have to play with the configuration and load calcs, but I'm sure there's a low param, neat solution to the request/service problem.

13. arthurcolle ◴[21 Oct 25 02:30 UTC] No.45651815{3}[source]▶

>>45651174 #

That's why deepseek only serves two models

14. godelski ◴[21 Oct 25 05:05 UTC] No.45652636{3}[source]▶

>>45651174 #

  > That's an eternity for a request. I highly doubt they will timeout any model they serve.

That's what easing functions are for.

Let's say 10 GPUs are in use. You keep another 3 with the model loaded. If demand increases slowly you slowly increase your headroom. If demand increases rapidly, you also increase rapidly.

The correct way to do this is more complicated and you should model based on your usage history, but if you have sufficient headroom then very few should be left idle. Remember that these models do requests in batches.

If they don't timeout models, they're throwing money down the drain. Though that wouldn't be uncommon.

replies(2): >>45652744 #>>45654780 #

15. jychang ◴[21 Oct 25 05:28 UTC] No.45652744{4}[source]▶

>>45652636 #

That's only if you're expecting 10 GPUs in use. They're dealing with ~1 GPU in use for a model, just sitting there. Alibaba has a very long tail of old models that barely anyone uses anymore, and yet they still serve.

Here's a quote from the paper above:

> Given a list of M models to be served, our goal is to minimize the number of GPU instances N required to meet the SLOs for all models through auto-scaling, thus maximizing resource usage. The strawman strategy, i.e., no auto-scaling at all, reserves at least one dedicated instance for each model, leading to N = O(M)

For example, Qwen2 72b is rarely used these days. And yet it will take up 2 of their H20 gpus (with 96GB VRAM) to serve, at the bare minimum, assuming that they don't quantize the BF16 down to FP8 (and I don't think they would, although other providers probably would). And then there's other older models, like the Qwen 2.5, Qwen 2, Qwen 1.5, and Qwen 1 series models. They all take up VRAM if the endpoint is active!

Alibaba cannot easily just timeout these models from VRAM, even if they only get 1 request per hour.

That's the issue. Their backlog of models take up a large amount of VRAM, and yet get ZERO compute most of the time! You can easily use an easing function to scale up from 2 gpus to 200 gpus, but you cannot ever timeout the last 2 gpus that's serving the model.

If you read the paper linked above, it's actually quite interesting how Alibaba goes and solves this problem.

Meanwhile on the other hand, Deepseek solves the issue by just saying "fuck you, we're serving only our latest model and you can deal with it". They're pretty pragmatic about it at least.

16. ◴[21 Oct 25 07:46 UTC] No.45653451{3}[source]▶

>>45651174 #

17. miki123211 ◴[21 Oct 25 07:59 UTC] No.45653517[source]▶

>>45647863 (TP) #

Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

replies(3): >>45654985 #>>45655187 #>>45657138 #

18. jgalt212 ◴[21 Oct 25 11:59 UTC] No.45654780{4}[source]▶

>>45652636 #

The thundering herd breaks this scheme.

19. cnr ◴[21 Oct 25 12:29 UTC] No.45654985[source]▶

>>45653517 #

Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?

replies(2): >>45655359 #>>45657343 #

20. behnamoh ◴[21 Oct 25 12:58 UTC] No.45655187[source]▶

>>45653517 #

> This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

Local models are never a dumb idea. The only time it's dumb to use them in an enterprise is if the infra is Mac Studio with M3 Ultra because pp time is terrible.

21. carderne ◴[21 Oct 25 13:14 UTC] No.45655359{3}[source]▶

>>45654985 #

Layman understanding:

Because as a function of hardware and electricity costs, a “cloud” GPU will be many times more efficient per output token. You aren’t loading/offloading models and don’t have any parts of the GPU waiting for input. Everything is fully saturated always.

22. tobyhinloopen ◴[21 Oct 25 14:50 UTC] No.45656541{3}[source]▶

>>45651174 #

Why does it take 60 seconds to load data from RAM to VRAM? Shouldn't the PCIE bandwidth allow it to fully load it in a few seconds?

replies(1): >>45659544 #

23. cnr ◴[21 Oct 25 15:38 UTC] No.45657138[source]▶

>>45653517 #

Let's say, then, that it's not so much "dumb and wasteful" as "energy inefficient". In fact, this can be quite wise in a modern world full of surveillance-as-a-business and "us-east-1 disasters"

24. OJFord ◴[21 Oct 25 15:55 UTC] No.45657343{3}[source]▶

>>45654985 #

I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.

25. throw_me_uwu ◴[21 Oct 25 18:21 UTC] No.45659544{4}[source]▶

>>45656541 #

Because ML infra is bloatware beyond belief.

If it was engineered right, it would take:

- transfer model weights from NVMe drive/RAM to GPU via PCIe

- upload tiny precompiled code to GPU

- run it with tiny CPU host code

But what you get instead is gigabytes of PyTorch + Nvidia docker container bloatware (hi Nvidia NeMo) that takes forever to start.

↑