←back to thread

426 points benchmarkist | 6 comments | | HN request time: 0.838s | source | bottom
Show context
zackangelo ◴[] No.42179476[source]
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #
1. parsimo2010 ◴[] No.42179503[source]
They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.

https://cerebras.ai/product-chip/

replies(1): >>42179580 #
2. coder543 ◴[] No.42179580[source]
To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.
replies(4): >>42179681 #>>42179826 #>>42180319 #>>42181360 #
3. ◴[] No.42179681[source]
4. cma ◴[] No.42179826[source]
It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.
5. tomrod ◴[] No.42180319[source]
Amazing!
6. ◴[] No.42181360[source]