(cerebras.ai)

427 points benchmarkist | 1 comments | 19 Nov 24 00:15 UTC | HN request time: 0.203s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

yalok ◴[19 Nov 24 02:15 UTC] No.42179489[source]▶

>>42179476 #

how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?

In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?

replies(2): >>42179533 #>>42180112 #

1. zackangelo ◴[19 Nov 24 02:23 UTC] No.42179533[source]▶

>>42179489 #

It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.

So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.

↑

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference