←back to thread

426 points benchmarkist | 1 comments | | HN request time: 0s | source
Show context
zackangelo ◴[] No.42179476[source]
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #
modeless ◴[] No.42179493[source]
Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.

They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.

replies(2): >>42180735 #>>42190988 #
1. ryao ◴[] No.42190988[source]
They do not use HBM. Offchip memory is accessible at 150GB/sec.