←back to thread

426 points benchmarkist | 1 comments | | HN request time: 0.207s | source
Show context
zackangelo ◴[] No.42179476[source]
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #
danpalmer ◴[] No.42179501[source]
Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #
accrual ◴[] No.42179717[source]
I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
replies(4): >>42179769 #>>42179834 #>>42180050 #>>42180265 #
1. visarga ◴[] No.42180265[source]
You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.