Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

426 points benchmarkist | 3 comments | 19 Nov 24 00:15 UTC | HN request time: 0.001s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

danpalmer ◴[19 Nov 24 02:17 UTC] No.42179501[source]▶

>>42179476 #

Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #

accrual ◴[19 Nov 24 03:02 UTC] No.42179717[source]▶

>>42179501 #

I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).

replies(4): >>42179769 #>>42179834 #>>42180050 #>>42180265 #

danpalmer ◴[19 Nov 24 04:13 UTC] No.42180050[source]▶

>>42179717 #

I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.

replies(5): >>42180442 #>>42180470 #>>42180527 #>>42181229 #>>42181357 #

bboygravity ◴[19 Nov 24 06:10 UTC] No.42180470[source]▶

>>42180050 #

That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.

replies(3): >>42180845 #>>42180967 #>>42181580 #

1. dgfl ◴[19 Nov 24 09:38 UTC] No.42181580[source]▶

>>42180470 #

Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.

Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.

In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.

On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking). Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.

It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.

replies(1): >>42182481 #

2. adrian_b ◴[19 Nov 24 11:57 UTC] No.42182481[source]▶

>>42181580 (TP) #

Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.

The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.

Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.

replies(1): >>42190978 #

3. ryao ◴[20 Nov 24 05:06 UTC] No.42190978[source]▶

>>42182481 #

Cerebras has a patent on the technique used to etch across scribe lines. Is there any prior work that would invalidate that patent?

By the way, I am a software developer, so you will not see me challenging their patent. I am just curious.

↑