Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

427 points benchmarkist | 2 comments | 19 Nov 24 00:15 UTC | HN request time: 0.001s | source

Show context

danpalmer ◴[19 Nov 24 02:23 UTC] No.42179527[source]▶

I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).

From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.

The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?

I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.

replies(1): >>42179788 #

qeternity ◴[19 Nov 24 03:14 UTC] No.42179788[source]▶

>>42179527 #

Everyone presumes this is under ideal conditions...and it's incredible.

It's bs=1. At 1,000 t/s. Of a 405B parameter model. Wild.

replies(3): >>42179907 #>>42180069 #>>42191042 #

danpalmer ◴[19 Nov 24 04:18 UTC] No.42180069[source]▶

>>42179788 #

Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.

replies(1): >>42180185 #

qeternity ◴[19 Nov 24 04:56 UTC] No.42180185[source]▶

>>42180069 #

I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.

replies(2): >>42180357 #>>42180762 #

danpalmer ◴[19 Nov 24 05:39 UTC] No.42180357[source]▶

>>42180185 #

In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.

replies(1): >>42180469 #

1. sam_dam_gai ◴[19 Nov 24 06:09 UTC] No.42180469[source]▶

>>42180357 #

Ha ha. He probably means ”at a batch size of 1”, i.e. not even using some amortization tricks to get better numbers.

replies(1): >>42180534 #

2. danpalmer ◴[19 Nov 24 06:23 UTC] No.42180534[source]▶

>>42180469 (TP) #

Ah! That does make more sense!

↑