Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Not enormous without significant changes to the ML. There are two pieces to this: improving efficiency and improving flops.

Improving flops is the most obvious way to improve speed, but I think we're pretty close to physical limits for a given process node and datatype precision. It's hard to give proof positive of this, but there are a few lines of evidence. One is that the fundamental operation of LLMs, matrix multiplications, are really simple (unlike e.g. CPU work) and so all the e.g. control flow logic is pretty minimized. We're largely spending electricity on doing the matrix multiplications themselves, and the matrix multiplications are in fact electricity-bound[1]. There are gains to be made by changing precision, but this is difficult and we're close to tapped out on it in my opinion (already very low precisions (fp8 can't represent 17), new research showing limitations).

Efficiency in LLM training is measured with a very punishing standard, "Model Flops Utilization" (MFU), where we divide the theoretical number of flops the hardware could provide with the theoretical number of flops necessary to implement the mathematical operation. We're able to get 30% without thinking (just FSDP) and 50-60% are not implausible/unheard of. The inefficiency is largely because 1) the hardware can't provide the number of flops it says on the tin for various reasons and 2) we have to synchronize terabytes of data across tens of thousands of machines. The theoretical limit here is 2x but in practice there's not a ton to eke out here.

There will be gains but they will be mostly focused on reducing NVIDIA's margin (TPU), on improving process node, on reducing datatype (B100), or on enlarging the size of a chip to reduce costly cross-chip communication (B100). There's not room for a 10x (again at constant precision and process node).

[1]: https://www.thonking.ai/p/strangely-matrix-multiplications