←back to thread

69 points Jrxing | 2 comments | | HN request time: 0.416s | source
1. jewel ◴[] No.45661100[source]
In my imagination, I thought that the large GPU clusters were dynamically allocating whole machines to different tasks depending on load.

So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.

Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.

replies(1): >>45664089 #
2. Jrxing ◴[] No.45664089[source]
Sharing the big GPU cluster with non-latency critical load is one solution we also explored.

For this work, we are targeting more on the problem of smaller models running SOTA GPUs. Distilled/fine-tuned small models have shown comparable performance in vertial tasks.