While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?
While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?
According to the author of Exo https://blog.exolabs.net/day-1/:
> When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).
I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.