In my imagination, I thought that the large GPU clusters were dynamically allocating whole machines to different tasks depending on load.
So, hypothetically, if ChatGPT's peak load and their minimum load were a 3× ratio, they'd reallocate 2/3 of their servers to training when it's not peak time.
Doing the same thing inside an individual GPU seems irrelevant to anyone operating at scale when they can approximate the same behavior with entire servers or even entire racks.
replies(1):