> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?
(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)
That's an eternity for a request. I highly doubt they will timeout any model they serve.
If it was engineered right, it would take:
- transfer model weights from NVMe drive/RAM to GPU via PCIe
- upload tiny precompiled code to GPU
- run it with tiny CPU host code
But what you get instead is gigabytes of PyTorch + Nvidia docker container bloatware (hi Nvidia NeMo) that takes forever to start.