Most active commenters

ryao(5)
danpalmer(3)
YetAnotherNick(3)

Popular/hot comments

>>42180615 #

←back to thread

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

>>42178761 (OP) #

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

danpalmer ◴[19 Nov 24 02:17 UTC] No.42179501[source]▶

>>42179476 #

Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #

accrual ◴[19 Nov 24 03:02 UTC] No.42179717[source]▶

>>42179501 #

I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).

replies(4): >>42179769 #>>42179834 #>>42180050 #>>42180265 #

danpalmer ◴[19 Nov 24 04:13 UTC] No.42180050[source]▶

>>42179717 #

I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.

replies(5): >>42180442 #>>42180470 #>>42180527 #>>42181229 #>>42181357 #

1. trsohmers ◴[19 Nov 24 06:21 UTC] No.42180527[source]▶

>>42180050 #

Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.

The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B

replies(5): >>42180544 #>>42181290 #>>42181897 #>>42181931 #>>42190965 #

2. danpalmer ◴[19 Nov 24 06:25 UTC] No.42180544[source]▶

>>42180527 (TP) #

Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.

replies(1): >>42180615 #

3. YetAnotherNick ◴[19 Nov 24 06:42 UTC] No.42180615[source]▶

>>42180544 #

You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s.

The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.

replies(3): >>42182416 #>>42182750 #>>42190957 #

4. meowface ◴[19 Nov 24 08:52 UTC] No.42181290[source]▶

>>42180527 (TP) #

Thank you for the breakdown. Bit of an emotional journey.

"$500 in the future...? Oh, $30 million now, so that might be a while..."

replies(1): >>42181646 #

5. jamalaramala ◴[19 Nov 24 09:46 UTC] No.42181646[source]▶

>>42181290 #

It took 30 years for computers go from entire rooms to desktops, and another 30 years to go from desktops to our pockets.

I don't know if we can extrapolate, but I can imagine AI inference on our desktops for $500 in a few years...

replies(2): >>42182572 #>>42197460 #

6. sumedh ◴[19 Nov 24 10:24 UTC] No.42181897[source]▶

>>42180527 (TP) #

> Based on their S1 filing and public statements

Is it a good stock to buy :)

7. petra ◴[19 Nov 24 10:29 UTC] No.42181931[source]▶

>>42180527 (TP) #

Given those details they seem not much better on cost per token than nvidia based systems.

8. latchkey ◴[19 Nov 24 11:49 UTC] No.42182416{3}[source]▶

>>42180615 #

Memory bandwidth and memory size. Along with power/cooling density.

Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.

You can see the direction AMD is going with this...

https://www.amd.com/en/products/accelerators/instinct/mi300/...

9. stefs ◴[19 Nov 24 12:10 UTC] No.42182572{3}[source]▶

>>42181646 #

well, we can AI inference on our desktops for $500 today, just with smaller models and far slower.

replies(1): >>42197536 #

10. Const-me ◴[19 Nov 24 12:35 UTC] No.42182750{3}[source]▶

>>42180615 #

> For 969 tok/s in int8, you need 392 TB/s memory bandwidth

I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.

replies(1): >>42190945 #

11. ryao ◴[20 Nov 24 04:58 UTC] No.42190945{4}[source]▶

>>42182750 #

They claim to obtain that number with 8 to 20 concurrent users:

https://x.com/draecomino/status/1858998347090325846

12. ryao ◴[20 Nov 24 05:01 UTC] No.42190957{3}[source]▶

>>42180615 #

Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.

replies(1): >>42193142 #

13. ryao ◴[20 Nov 24 05:03 UTC] No.42190965[source]▶

>>42180527 (TP) #

From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.

As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.

14. YetAnotherNick ◴[20 Nov 24 12:07 UTC] No.42193142{4}[source]▶

>>42190957 #

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

replies(1): >>42197442 #

15. ryao ◴[20 Nov 24 19:50 UTC] No.42197442{5}[source]▶

>>42193142 #

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

replies(1): >>42200967 #

16. ◴[20 Nov 24 19:51 UTC] No.42197460{3}[source]▶

>>42181646 #

17. ryao ◴[20 Nov 24 20:00 UTC] No.42197536{4}[source]▶

>>42182572 #

There is no need to use smaller models. You can run the biggest models such as llama 3.1 405B on a fairly low end desktop today:

https://github.com/lyogavin/airllm

However, it will be far slower as you said.

18. YetAnotherNick ◴[21 Nov 24 04:01 UTC] No.42200967{6}[source]▶

>>42197442 #

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

↑