> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?
(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)
That's an eternity for a request. I highly doubt they will timeout any model they serve.
I'd have to play with the configuration and load calcs, but I'm sure there's a low param, neat solution to the request/service problem.
  > That's an eternity for a request. I highly doubt they will timeout any model they serve.
Let's say 10 GPUs are in use. You keep another 3 with the model loaded. If demand increases slowly you slowly increase your headroom. If demand increases rapidly, you also increase rapidly.
The correct way to do this is more complicated and you should model based on your usage history, but if you have sufficient headroom then very few should be left idle. Remember that these models do requests in batches.
If they don't timeout models, they're throwing money down the drain. Though that wouldn't be uncommon.
Here's a quote from the paper above:
> Given a list of M models to be served, our goal is to minimize the number of GPU instances N required to meet the SLOs for all models through auto-scaling, thus maximizing resource usage. The strawman strategy, i.e., no auto-scaling at all, reserves at least one dedicated instance for each model, leading to N = O(M)
For example, Qwen2 72b is rarely used these days. And yet it will take up 2 of their H20 gpus (with 96GB VRAM) to serve, at the bare minimum, assuming that they don't quantize the BF16 down to FP8 (and I don't think they would, although other providers probably would). And then there's other older models, like the Qwen 2.5, Qwen 2, Qwen 1.5, and Qwen 1 series models. They all take up VRAM if the endpoint is active!
Alibaba cannot easily just timeout these models from VRAM, even if they only get 1 request per hour.
That's the issue. Their backlog of models take up a large amount of VRAM, and yet get ZERO compute most of the time! You can easily use an easing function to scale up from 2 gpus to 200 gpus, but you cannot ever timeout the last 2 gpus that's serving the model.
If you read the paper linked above, it's actually quite interesting how Alibaba goes and solves this problem.
Meanwhile on the other hand, Deepseek solves the issue by just saying "fuck you, we're serving only our latest model and you can deal with it". They're pretty pragmatic about it at least.
If it was engineered right, it would take:
- transfer model weights from NVMe drive/RAM to GPU via PCIe
- upload tiny precompiled code to GPU
- run it with tiny CPU host code
But what you get instead is gigabytes of PyTorch + Nvidia docker container bloatware (hi Nvidia NeMo) that takes forever to start.