Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

427 points benchmarkist | 2 comments | 19 Nov 24 00:15 UTC | HN request time: 0.468s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

danpalmer ◴[19 Nov 24 02:17 UTC] No.42179501[source]▶

>>42179476 #

Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #

accrual ◴[19 Nov 24 03:02 UTC] No.42179717[source]▶

>>42179501 #

I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).

replies(4): >>42179769 #>>42179834 #>>42180050 #>>42180265 #

danpalmer ◴[19 Nov 24 04:13 UTC] No.42180050[source]▶

>>42179717 #

I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.

replies(5): >>42180442 #>>42180470 #>>42180527 #>>42181229 #>>42181357 #

bboygravity ◴[19 Nov 24 06:10 UTC] No.42180470[source]▶

>>42180050 #

That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.

replies(3): >>42180845 #>>42180967 #>>42181580 #

1. dheera ◴[19 Nov 24 07:59 UTC] No.42180967[source]▶

>>42180470 #

It will also mean 405B models will be uninteresting in 3 to 5 years if we follow the curve we've been on for the past decades.

replies(1): >>42181405 #

2. int_19h ◴[19 Nov 24 09:13 UTC] No.42181405[source]▶

>>42180967 (TP) #

I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).

↑