Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).

From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.

The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?

I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.