Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

427 points benchmarkist | 1 comments | 19 Nov 24 00:15 UTC | HN request time: 0.206s | source

Show context

owenpalmer ◴[19 Nov 24 06:33 UTC] No.42180575[source]▶

The fact that such a boost is possible with new hardware, I wonder what the ceiling is for improving performance for training via hardware as well.

replies(2): >>42180618 #>>42180710 #

bufferoverflow ◴[19 Nov 24 06:42 UTC] No.42180618[source]▶

>>42180575 #

The ultimate solution would be to convert an LLM to a pure ASIC.

My guess is that would 10X the performance. But then it's a very very expensive solution.

replies(2): >>42180716 #>>42187711 #

why_only_15 ◴[19 Nov 24 07:01 UTC] No.42180716[source]▶

>>42180618 #

Why would converting a specific LLM to an ASIC help you? LLMs are like 99% matrix multiplications by work and we already have things that amount to ASICs for matrix multiplications (e.g. TPU) that aren't cheaper than e.g. H100

replies(1): >>42189318 #

mikewarot ◴[19 Nov 24 23:47 UTC] No.42189318[source]▶

>>42180716 #

An ASIC could have all of the weights baked into the design, completely eliminating the von Neumann bottleneck that plagues computation.

They are inherently parallel, so you might be able to get a token per clock cycle. A billion tokens per second opens quite a few possibilities.

It could also eliminate all of the multiplication or addition of bits that are 0 from the design, making each multiply smaller by 50 percent silicon area, on average.

However, an ASIC is a speculation that all the design tools work. It may require multiple rounds to get it right.

replies(1): >>42191117 #

ryao ◴[20 Nov 24 05:45 UTC] No.42191117[source]▶

>>42189318 #

I doubt you could have a token per clock cycle unless it is very low clocked. In practice, even dedicated hardware for matrix-matrix multiplication does not perform the multiplication in a single clock cycle. Presumably, the circuit paths would be so large that you would need to have a very slow clock to make that work, and there are many matrix multiplications done per token. Furthermore, things are layered and must run through each layer. Presumably if you implement this you would aim for 1 layer per clock cycle, but even that seems like it would be quite long as far as circuit paths go.

I have some local code running llama 3 8B and matrix multiplications in it are being done by 2D matrices with dimensions ranging from 1024 to 4096. Let’s just go with a nice 1024x1024 matrix and do matrix-vector multiplication, which is the minimum needed to implement llama3. That is 1048576 elements. If you try to do matrix-vector multiplication in 1 cycle, you will need 1048576 fmadd units.

I am by no means a chip designer, so I asked ChatGPT to estimate how many transistors are needed for a bf16 fmadd unit. It said 100,000 to 200,000. Let’s go with 100,000 transistors per unit. Thus to implement a single matrix multiplication according to your idea, we would need over 100 billion transistors, and this is only a small part of the llama 3 8b model’s calculations. You would probably be well into the trillions of transistors if you implemented all of it in an ASIC and did 1 layer per cycle (don’t even think of 1 token per cycle). For reference, Nvidia’s H100 has 80 billion transistors. The CSE-3 has 4 trillion transistors and I am not sure if even that would be enough.

It is a nice idea, but I do not think it is feasible with current technology. That said, I do like your out of box thinking. This might be a bit too far out of the box, but there is probably a middle ground somewhere.

replies(1): >>42191542 #

1. mikewarot ◴[20 Nov 24 07:28 UTC] No.42191542[source]▶

>>42191117 #

You're right in the numbers, I wasn't thinking of trying to push all of that into one chip, but if you can distribute the problem such that an array of chips can break the problem apart cleanly, the numbers fall within the range of what's feasible with modern technology.

The key to this, in my view, is to give up on the idea of trying to get the latency as low as possible for a given piece of computation, as it typically done, and instead try to make reliable small cells that are clocked so that you don't have to worry about getting data far or fast. Taking this idea to its limits has a completely homogeneous systolic array that operates on 4 bits at a time, using look up tables to do everything. No dedicated switching fabric, multipliers, or anything else.

It's the same tradeoff von Neumann made with the ENIAC, which slowed it down by a factor of 6 (according to wikipedia), but eliminating multiple weeks of human labor in setup by instead loading stored programs effectively instantly.

To multiply numbers, you don't have to do all of it at the same time, you just have to pipeline the steps so that all of them are taking place for part of the data, and it all stays synced (which the clocking again helps)

Since I'm working alone, right now I'm just trying to get something that other people can grok, and play with.

Ideally, I'd have chips with multiple channels of LVDS interfaces running at 10 Gbps or more each to allow meshing the chips. Mostly, they'd be vast strings of D flip flops and 16:1 multiplexers.

I'm well aware of the fact that I've made arbitrary choices, and they might not be optimal for real world hardware. I do remain steadfast in my opinion that providing a better impedance match between the computing substrate and the code that runs on it could allow multiple orders improvement in efficiency. Not to mention the ability to run the exact same code on everything from an emulator to every successive version/size of the chip, without recompilation.

Not to mention being able to route around bad cells, actually build "walls" around code with sensitive info, etc.

↑