←back to thread

521 points hd4 | 3 comments | | HN request time: 0.513s | source
Show context
kilotaras ◴[] No.45644776[source]
Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

replies(5): >>45645037 #>>45647752 #>>45647863 #>>45651559 #>>45653363 #
bee_rider ◴[] No.45647863[source]
I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?

I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?

(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)

replies(6): >>45648058 #>>45648291 #>>45648653 #>>45649219 #>>45650208 #>>45653517 #
miki123211 ◴[] No.45653517[source]
Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

replies(3): >>45654985 #>>45655187 #>>45657138 #
1. cnr ◴[] No.45654985[source]
Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?
replies(2): >>45655359 #>>45657343 #
2. carderne ◴[] No.45655359[source]
Layman understanding:

Because as a function of hardware and electricity costs, a “cloud” GPU will be many times more efficient per output token. You aren’t loading/offloading models and don’t have any parts of the GPU waiting for input. Everything is fully saturated always.

3. OJFord ◴[] No.45657343[source]
I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.