(cerebras.ai)

427 points benchmarkist | 1 comments | 19 Nov 24 00:15 UTC | HN request time: 0.239s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

danpalmer ◴[19 Nov 24 02:17 UTC] No.42179501[source]▶

>>42179476 #

Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #

zackangelo ◴[19 Nov 24 02:18 UTC] No.42179509[source]▶

>>42179501 #

Ah, makes a lot more sense now.

replies(1): >>42179747 #

1. StrangeDoctor ◴[19 Nov 24 03:07 UTC] No.42179747[source]▶

>>42179509 #

also the WSE3 pulls 15kw. https://www.eetimes.com/cerebras-third-gen-wafer-scale-chip-...

but 8x h100 are ~2.6-5.2kw (I get conflicting info, I think based on pice vs smx) so anywhere between roughly even and up to 2x efficient.

↑

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference