←back to thread

426 points benchmarkist | 10 comments | | HN request time: 1.969s | source | bottom
1. danpalmer ◴[] No.42179527[source]
I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).

From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.

The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?

I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.

replies(1): >>42179788 #
2. qeternity ◴[] No.42179788[source]
Everyone presumes this is under ideal conditions...and it's incredible.

It's bs=1. At 1,000 t/s. Of a 405B parameter model. Wild.

replies(3): >>42179907 #>>42180069 #>>42191042 #
3. colordrops ◴[] No.42179907[source]
Right, I'd assume most LLM benchmarks are run on dedicated hardware.
4. danpalmer ◴[] No.42180069[source]
Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.
replies(1): >>42180185 #
5. qeternity ◴[] No.42180185{3}[source]
I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
replies(2): >>42180357 #>>42180762 #
6. danpalmer ◴[] No.42180357{4}[source]
In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.
replies(1): >>42180469 #
7. sam_dam_gai ◴[] No.42180469{5}[source]
Ha ha. He probably means ”at a batch size of 1”, i.e. not even using some amortization tricks to get better numbers.
replies(1): >>42180534 #
8. danpalmer ◴[] No.42180534{6}[source]
Ah! That does make more sense!
9. aurareturn ◴[] No.42180762{4}[source]

  I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.

8x H100 can also do fine tuning right? Does Cerebras offer fine tuning support?
10. ryao ◴[] No.42191042[source]
They claim it is with between 8 and 20 users:

https://x.com/draecomino/status/1858998347090325846

That said, they appear to be giving the per user performance.