Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

427 points benchmarkist | 5 comments | 19 Nov 24 00:15 UTC | HN request time: 1.004s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

1. mikewarot ◴[19 Nov 24 06:32 UTC] No.42180569[source]▶

>>42179476 #

Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency with a conservative 1 Ghz clock rate.

Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.

In summary, 1 Gigatoken/second, divided into 100,000 separate users each getting 10k tokens/second.

This is the future I want to build.

replies(2): >>42181807 #>>42181968 #

2. seangrogg ◴[19 Nov 24 10:10 UTC] No.42181807[source]▶

>>42180569 (TP) #

I'm actively trying to learn how to do exactly this, though I'm just getting started with FPGAs now so probably a very long range goal.

replies(1): >>42191030 #

3. jacobgorm ◴[19 Nov 24 10:36 UTC] No.42181968[source]▶

>>42180569 (TP) #

See Convolutional Differentiable Logic Gate Networks https://arxiv.org/abs/2411.04732 , which is a small step in that direction.

4. ryao ◴[20 Nov 24 05:21 UTC] No.42191030[source]▶

>>42181807 #

There is not enough memory attached to FPGAs to do this. Some FOGAs come with 16GB of HBM attached, but that is not enough and the bandwidth provided is not as high as it is on GPUs. You would need to work out how to connect enough memory chips simultaneously to get high bandwidth and enough capacity in order for a FPGA solution to be performance competitive with a GPU solution.

replies(1): >>42191545 #

5. mikewarot ◴[20 Nov 24 07:29 UTC] No.42191545{3}[source]▶

>>42191030 #

Instead of separate memory/compute, I propose to fuse them.

↑