←back to thread

426 points benchmarkist | 5 comments | | HN request time: 0.839s | source
Show context
zackangelo ◴[] No.42179476[source]
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #
1. mikewarot ◴[] No.42180569[source]
Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency with a conservative 1 Ghz clock rate.

Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.

In summary, 1 Gigatoken/second, divided into 100,000 separate users each getting 10k tokens/second.

This is the future I want to build.

replies(2): >>42181807 #>>42181968 #
2. seangrogg ◴[] No.42181807[source]
I'm actively trying to learn how to do exactly this, though I'm just getting started with FPGAs now so probably a very long range goal.
replies(1): >>42191030 #
3. jacobgorm ◴[] No.42181968[source]
See Convolutional Differentiable Logic Gate Networks https://arxiv.org/abs/2411.04732 , which is a small step in that direction.
4. ryao ◴[] No.42191030[source]
There is not enough memory attached to FPGAs to do this. Some FOGAs come with 16GB of HBM attached, but that is not enough and the bandwidth provided is not as high as it is on GPUs. You would need to work out how to connect enough memory chips simultaneously to get high bandwidth and enough capacity in order for a FPGA solution to be performance competitive with a GPU solution.
replies(1): >>42191545 #
5. mikewarot ◴[] No.42191545{3}[source]
Instead of separate memory/compute, I propose to fuse them.