My guess is that would 10X the performance. But then it's a very very expensive solution.
They are inherently parallel, so you might be able to get a token per clock cycle. A billion tokens per second opens quite a few possibilities.
It could also eliminate all of the multiplication or addition of bits that are 0 from the design, making each multiply smaller by 50 percent silicon area, on average.
However, an ASIC is a speculation that all the design tools work. It may require multiple rounds to get it right.
I have some local code running llama 3 8B and matrix multiplications in it are being done by 2D matrices with dimensions ranging from 1024 to 4096. Let’s just go with a nice 1024x1024 matrix and do matrix-vector multiplication, which is the minimum needed to implement llama3. That is 1048576 elements. If you try to do matrix-vector multiplication in 1 cycle, you will need 1048576 fmadd units.
I am by no means a chip designer, so I asked ChatGPT to estimate how many transistors are needed for a bf16 fmadd unit. It said 100,000 to 200,000. Let’s go with 100,000 transistors per unit. Thus to implement a single matrix multiplication according to your idea, we would need over 100 billion transistors, and this is only a small part of the llama 3 8b model’s calculations. You would probably be well into the trillions of transistors if you implemented all of it in an ASIC and did 1 layer per cycle (don’t even think of 1 token per cycle). For reference, Nvidia’s H100 has 80 billion transistors. The CSE-3 has 4 trillion transistors and I am not sure if even that would be enough.
It is a nice idea, but I do not think it is feasible with current technology. That said, I do like your out of box thinking. This might be a bit too far out of the box, but there is probably a middle ground somewhere.
The key to this, in my view, is to give up on the idea of trying to get the latency as low as possible for a given piece of computation, as it typically done, and instead try to make reliable small cells that are clocked so that you don't have to worry about getting data far or fast. Taking this idea to its limits has a completely homogeneous systolic array that operates on 4 bits at a time, using look up tables to do everything. No dedicated switching fabric, multipliers, or anything else.
It's the same tradeoff von Neumann made with the ENIAC, which slowed it down by a factor of 6 (according to wikipedia), but eliminating multiple weeks of human labor in setup by instead loading stored programs effectively instantly.
To multiply numbers, you don't have to do all of it at the same time, you just have to pipeline the steps so that all of them are taking place for part of the data, and it all stays synced (which the clocking again helps)
Since I'm working alone, right now I'm just trying to get something that other people can grok, and play with.
Ideally, I'd have chips with multiple channels of LVDS interfaces running at 10 Gbps or more each to allow meshing the chips. Mostly, they'd be vast strings of D flip flops and 16:1 multiplexers.
I'm well aware of the fact that I've made arbitrary choices, and they might not be optimal for real world hardware. I do remain steadfast in my opinion that providing a better impedance match between the computing substrate and the code that runs on it could allow multiple orders improvement in efficiency. Not to mention the ability to run the exact same code on everything from an emulator to every successive version/size of the chip, without recompilation.
Not to mention being able to route around bad cells, actually build "walls" around code with sensitive info, etc.