(cerebras.ai)

427 points benchmarkist | 2 comments | 19 Nov 24 00:15 UTC | HN request time: 0.438s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

1. mmaunder ◴[19 Nov 24 03:16 UTC] No.42179794[source]▶

>>42179476 #

Nah. Try vLLM and 405B FP8 on that hardware. And make sure you’re benchmarking with some concurrency for max TPS.

replies(1): >>42189943 #

2. zackangelo ◴[20 Nov 24 01:27 UTC] No.42189943[source]▶

>>42179794 (TP) #

Related recent discussion on twitter: https://x.com/Teknium1/status/1858987850739728635

Looks like other folks get 80 tok/s with max batch size, that's surprising to me but vLLM is definitely more optimized than my implementation.

↑

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference