←back to thread

426 points benchmarkist | 9 comments | | HN request time: 0.822s | source | bottom
Show context
zackangelo ◴[] No.42179476[source]
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #
1. simonw ◴[] No.42180035[source]
They have a chip the size of a dinner plate. Take a look at the pictures: https://cerebras.ai/product-chip/
replies(3): >>42180116 #>>42180150 #>>42180490 #
2. pram ◴[] No.42180116[source]
I'd love to see the heatsink for this lol
replies(1): >>42180317 #
3. Aeolun ◴[] No.42180150[source]
21 petabytes per second. Can push the whole internet over that chip xD
replies(2): >>42180630 #>>42180744 #
4. futureshock ◴[] No.42180317[source]
They call it the “engine block”!

https://www.servethehome.com/a-cerebras-cs-2-engine-block-ba...

5. ekianjo ◴[] No.42180490[source]
what kind of yield do they get on that size?
replies(2): >>42180600 #>>42180601 #
6. bufferoverflow ◴[] No.42180600[source]
It's near 100%. Discussed here:

https://youtu.be/f4Dly8I8lMY?t=95

7. petra ◴[] No.42180601[source]
Part of their technology is managing/bypassing defects.
8. why_only_15 ◴[] No.42180630[source]
The number for that is I believe 1 terabit or 125GB/s -- 21 petabytes is the speed from the SRAM (~registers) to the cores (~ALU) for the whole chip. It's not especially impressive for SRAM speeds. The impressive thing is that they have an enormous amount of SRAM
9. KeplerBoy ◴[] No.42180744[source]
That's their on chip cache bandwidth. Usually that stuff isn't even measured in bandwidth but latency.