Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

(www.tomshardware.com)

521 points hd4 | 3 comments | 20 Oct 25 12:31 UTC | HN request time: 0.62s | source

Paper: https://dl.acm.org/doi/10.1145/3731569.3764815

Show context

kilotaras ◴[20 Oct 25 15:05 UTC] No.45644776[source]▶

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

replies(5): >>45645037 #>>45647752 #>>45647863 #>>45651559 #>>45653363 #

1. yorwba ◴[20 Oct 25 15:27 UTC] No.45645037[source]▶

>>45644776 #

Not really, Figure 1(a) of the paper says that the 17.7% are relative to a total of 30k GPUs (i.e. 5310 GPUs for handling those 1.35% of requests) and the reduction is measured in a smaller beta deployment with only 47 different models (vs. the 733 "cold" models overall.) Naïve extrapolation by model count suggests they would need 3321 GPUs to serve all cold models, a 37.5% reduction to before. (Or 6.6% reduction of the full 30k-GPU cluster.)

replies(1): >>45646860 #

2. somerandomdude2 ◴[20 Oct 25 17:48 UTC] No.45646860[source]▶

>>45645037 (TP) #

Really:

"A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s."

Which, if you scale it, matches the GPs statement.

replies(1): >>45648333 #

3. yorwba ◴[20 Oct 25 19:45 UTC] No.45648333[source]▶

>>45646860 #

From the SCMP article you might get the impression that the various figures all refer to the same GPU cluster, but in the paper itself it's very clear that this is not the case, i.e. the 213 GPUs in the smaller cluster are not serving 1.35% of the requests in the larger cluster. Then if you want to scale it, you have a choice of different numbers you could scale, and each would get different results. Since they're constrained by the limited number of different models a single GPU can serve, I think scaling by the number of models is the most realistic option.

↑