(www.notion.so)

69 points Jrxing | 1 comments | 21 Oct 25 17:29 UTC | HN request time: 0s | source

Show context

jewel ◴[21 Oct 25 20:15 UTC] No.45661100[source]▶

In my imagination, I thought that the large GPU clusters were dynamically allocating whole machines to different tasks depending on load.

So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.

Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.

replies(1): >>45664089 #

1. Jrxing ◴[22 Oct 25 01:53 UTC] No.45664089[source]▶

>>45661100 #

Sharing the big GPU cluster with non-latency critical load is one solution we also explored.

For this work, we are targeting more on the problem of smaller models running SOTA GPUs. Distilled/fine-tuned small models have shown comparable performance in vertial tasks.

↑

Kvcached: Virtualized, elastic KV cache for LLM serving on shared GPUs