Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

1. parsimo2010 ◴[19 Nov 24 02:17 UTC] No.42179503[source]▶

>>42179476 #

They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.

https://cerebras.ai/product-chip/

replies(1): >>42179580 #

2. coder543 ◴[19 Nov 24 02:33 UTC] No.42179580[source]▶

>>42179503 (TP) #

To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.

replies(4): >>42179681 #>>42179826 #>>42180319 #>>42181360 #

3. ◴[19 Nov 24 02:54 UTC] No.42179681[source]▶

>>42179580 #

4. cma ◴[19 Nov 24 03:22 UTC] No.42179826[source]▶

>>42179580 #

It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.

5. tomrod ◴[19 Nov 24 05:27 UTC] No.42180319[source]▶

>>42179580 #

Amazing!

6. ◴[19 Nov 24 09:06 UTC] No.42181360[source]▶

>>42179580 #

↑