Nvidia GPU on bare metal NixOS Kubernetes cluster explained

1. colordrops ◴[02 Mar 25 22:33 UTC] No.43235976[source]▶

This looks fun. The author mentions machine learning workloads. What are typical machine learning use cases for a cluster of lower end GPUs?

While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?

replies(2): >>43236735 #>>43237588 #

2. fangpenlin ◴[03 Mar 25 00:05 UTC] No.43236735[source]▶

>>43235976 (TP) #

You can check Exo out:

https://github.com/exo-explore/exo

It's a project designed to run a large model in a distributed manner. My need for GPU is to run my own machine learning research pet project (mostly evolutionary neuron network models for now), and it's a bit different from inferencing needs. Training is yet another different story.

But yeah, I agreed. I think machine learning should be distributed more in the future.

replies(1): >>43237473 #

3. colordrops ◴[03 Mar 25 01:53 UTC] No.43237473[source]▶

>>43236735 #

Exo looks awesome, exactly what I had in mind, thank you.

4. thangngoc89 ◴[03 Mar 25 02:14 UTC] No.43237588[source]▶

>>43235976 (TP) #

The bottleneck on distributed GPUs training/inference is the inter-GPU connections speed. For a single node, it's doable because it utilized PCIe 4.0 connections. For a cluster, you need at least 50Gbps connection between nodes, which is expensive for cheap GPUs.

replies(1): >>43238180 #

5. fangpenlin ◴[03 Mar 25 04:04 UTC] No.43238180[source]▶

>>43237588 #

For training, yes, you will need to share the parameters (i.e., weights and bias); the number is huge. But for inference, you don't need that much high bandwidth to run it in a distributed manner.

According to the author of Exo https://blog.exolabs.net/day-1/:

> When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).

I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.