(cerebras.ai)

427 points benchmarkist | 1 comments | 19 Nov 24 00:15 UTC | HN request time: 0.203s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

yalok ◴[19 Nov 24 02:15 UTC] No.42179489[source]▶

>>42179476 #

how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?

In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?

replies(2): >>42179533 #>>42180112 #

qingcharles ◴[19 Nov 24 04:32 UTC] No.42180112[source]▶

>>42179489 #

It should work, I believe. And anything that doesn't fit you can leave on your system RAM.

Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

replies(1): >>42180911 #

1. joha4270 ◴[19 Nov 24 07:47 UTC] No.42180911[source]▶

>>42180112 #

> Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.

On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.

↑

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference