←back to thread

426 points benchmarkist | 4 comments | | HN request time: 0.833s | source
Show context
zackangelo ◴[] No.42179476[source]
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #
danpalmer ◴[] No.42179501[source]
Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #
accrual ◴[] No.42179717[source]
I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
replies(4): >>42179769 #>>42179834 #>>42180050 #>>42180265 #
danpalmer ◴[] No.42180050[source]
I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.
replies(5): >>42180442 #>>42180470 #>>42180527 #>>42181229 #>>42181357 #
bboygravity ◴[] No.42180470[source]
That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.
replies(3): >>42180845 #>>42180967 #>>42181580 #
1. schoen ◴[] No.42180845[source]
How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?

My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)

A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?

replies(1): >>42180977 #
2. chaxor ◴[] No.42180977[source]
Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM. Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.
replies(1): >>42181319 #
3. saagarjha ◴[] No.42181319[source]
Apple doesn’t sell those anymore.
replies(1): >>42184690 #
4. chaxor ◴[] No.42184690{3}[source]
Aw man, are they selling only 4GB ones now?

More seriously, even 16GB was essentially the 'norm' in consumer PCs about 15 years ago.