While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?
While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?
https://github.com/exo-explore/exo
It's a project designed to run a large model in a distributed manner. My need for GPU is to run my own machine learning research pet project (mostly evolutionary neuron network models for now), and it's a bit different from inferencing needs. Training is yet another different story.
But yeah, I agreed. I think machine learning should be distributed more in the future.
According to the author of Exo https://blog.exolabs.net/day-1/:
> When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).
I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.
For now I’m just using Docker’s Nvidia container runtime for containers that need GPU acceleration.
Will likely spend more time digging into your findings — hoping it results in me finding a solution to my setup!
[0] https://github.com/NixOS/nixpkgs/blob/master/pkgs/applicatio...
https://github.com/NVIDIA/k8s-device-plugin/issues/1182
And I opened a PR for fixing that here:
https://github.com/NVIDIA/k8s-device-plugin/pull/1183
I am unsure if this bug is only for the NixOS environment because its library paths and other quicks differ from those of major Linux distros.
Another major problem was that the "default_runtime_name" in the Containerd config didn't work as expected. I had to create a RuntimeClass and assign it to the pod to make it pick up the Nvidia runtime.
Other than that, I haven't tried K3S, the one I am running is a full-blown K8S cluster. I guess they should be similar.
While there's no guarantee, if you find any hints showing why your Nvidia plugin won't work here, I might be able to help, as I skip some minor issues I encountered in the articles. If it happens to be the ones I faced, I can share how I solved them.
https://gist.github.com/fangpenlin/1cc6e80b4a03f07b79412366b...
But later on, since I am taking the CDI route, it appears that the libnvidia-container (nvidia-container-cli) is not really used. If you are going with just container runtime approach instead of CDI, you may need a patch like this for the libnvidia-container package.
The fact that you don't see the problem immediately makes me also guess if it's just rendering that way only on mobiles or in my browser*os (Cromite on Android) or maybe a result of some resource (js/CSS/woff) getting blocked by the built in adblocker or something.