Nvidia GPU on bare metal NixOS Kubernetes cluster explained

(fangpenlin.com)

41 points fangpenlin | 1 comments | 02 Mar 25 20:26 UTC | HN request time: 0.001s | source

Show context

colordrops ◴[02 Mar 25 22:33 UTC] No.43235976[source]▶

This looks fun. The author mentions machine learning workloads. What are typical machine learning use cases for a cluster of lower end GPUs?

While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?

replies(2): >>43236735 #>>43237588 #

thangngoc89 ◴[03 Mar 25 02:14 UTC] No.43237588[source]▶

>>43235976 #

The bottleneck on distributed GPUs training/inference is the inter-GPU connections speed. For a single node, it's doable because it utilized PCIe 4.0 connections. For a cluster, you need at least 50Gbps connection between nodes, which is expensive for cheap GPUs.

replies(1): >>43238180 #

1. fangpenlin ◴[03 Mar 25 04:04 UTC] No.43238180[source]▶

>>43237588 #

For training, yes, you will need to share the parameters (i.e., weights and bias); the number is huge. But for inference, you don't need that much high bandwidth to run it in a distributed manner.

According to the author of Exo https://blog.exolabs.net/day-1/:

> When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).

I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.

↑