Most active commenters

ryao(8)
zackangelo(4)
danpalmer(3)
(3)
YetAnotherNick(3)
why_only_15(3)

Popular/hot comments

>>42180050 #
>>42180527 #
>>42179580 #
>>42179717 #
>>42179501 #
>>42180035 #
>>42180470 #
>>42180615 #

←back to thread

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

1. zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

>>42178761 (OP) #

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

2. yalok ◴[19 Nov 24 02:15 UTC] No.42179489[source]▶

>>42179476 (TP) #

how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?

In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?

replies(2): >>42179533 #>>42180112 #

3. modeless ◴[19 Nov 24 02:16 UTC] No.42179493[source]▶

>>42179476 (TP) #

Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.

They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.

replies(2): >>42180735 #>>42190988 #

4. danpalmer ◴[19 Nov 24 02:17 UTC] No.42179501[source]▶

>>42179476 (TP) #

Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.

I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.

replies(3): >>42179509 #>>42179599 #>>42179717 #

5. parsimo2010 ◴[19 Nov 24 02:17 UTC] No.42179503[source]▶

>>42179476 (TP) #

They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.

https://cerebras.ai/product-chip/

replies(1): >>42179580 #

6. zackangelo ◴[19 Nov 24 02:18 UTC] No.42179509[source]▶

>>42179501 #

Ah, makes a lot more sense now.

replies(1): >>42179747 #

7. zackangelo ◴[19 Nov 24 02:23 UTC] No.42179533[source]▶

>>42179489 #

It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.

So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.

8. coder543 ◴[19 Nov 24 02:33 UTC] No.42179580[source]▶

>>42179503 #

To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.

replies(4): >>42179681 #>>42179826 #>>42180319 #>>42181360 #

9. swyx ◴[19 Nov 24 02:36 UTC] No.42179599[source]▶

>>42179501 #

> TechTechPotato YouTube videos on Cerebras

https://www.youtube.com/@TechTechPotato/search?query=cerebra... for anyone also looking. there are quite a lot of them.

10. ◴[19 Nov 24 02:54 UTC] No.42179681{3}[source]▶

>>42179580 #

11. accrual ◴[19 Nov 24 03:02 UTC] No.42179717[source]▶

>>42179501 #

I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).

replies(4): >>42179769 #>>42179834 #>>42180050 #>>42180265 #

12. StrangeDoctor ◴[19 Nov 24 03:07 UTC] No.42179747{3}[source]▶

>>42179509 #

also the WSE3 pulls 15kw. https://www.eetimes.com/cerebras-third-gen-wafer-scale-chip-...

but 8x h100 are ~2.6-5.2kw (I get conflicting info, I think based on pice vs smx) so anywhere between roughly even and up to 2x efficient.

13. boroboro4 ◴[19 Nov 24 03:08 UTC] No.42179754[source]▶

>>42179476 (TP) #

There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.

14. chessgecko ◴[19 Nov 24 03:11 UTC] No.42179769{3}[source]▶

>>42179717 #

One day is doing some heavy heavy lifting here, we’re currently off by ~3-4 orders of magnitude…

replies(2): >>42179793 #>>42179975 #

15. accrual ◴[19 Nov 24 03:16 UTC] No.42179793{4}[source]▶

>>42179769 #

Thank you, for the reality check! :)

replies(1): >>42179931 #

16. mmaunder ◴[19 Nov 24 03:16 UTC] No.42179794[source]▶

>>42179476 (TP) #

Nah. Try vLLM and 405B FP8 on that hardware. And make sure you’re benchmarking with some concurrency for max TPS.

replies(1): >>42189943 #

17. cma ◴[19 Nov 24 03:22 UTC] No.42179826{3}[source]▶

>>42179580 #

It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.

18. killingtime74 ◴[19 Nov 24 03:23 UTC] No.42179834{3}[source]▶

>>42179717 #

Maybe not $500, but $500,000

19. thomashop ◴[19 Nov 24 03:46 UTC] No.42179931{5}[source]▶

>>42179793 #

We have moved 2 orders of magnitude in the last year. Not that unreasonable

20. grahamj ◴[19 Nov 24 03:54 UTC] No.42179975{4}[source]▶

>>42179769 #

So 1000-10000 days? ;)

replies(1): >>42182198 #

21. simonw ◴[19 Nov 24 04:08 UTC] No.42180035[source]▶

>>42179476 (TP) #

They have a chip the size of a dinner plate. Take a look at the pictures: https://cerebras.ai/product-chip/

replies(3): >>42180116 #>>42180150 #>>42180490 #

22. danpalmer ◴[19 Nov 24 04:13 UTC] No.42180050{3}[source]▶

>>42179717 #

I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.

replies(5): >>42180442 #>>42180470 #>>42180527 #>>42181229 #>>42181357 #

23. qingcharles ◴[19 Nov 24 04:32 UTC] No.42180112[source]▶

>>42179489 #

It should work, I believe. And anything that doesn't fit you can leave on your system RAM.

Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

replies(1): >>42180911 #

24. pram ◴[19 Nov 24 04:33 UTC] No.42180116[source]▶

>>42180035 #

I'd love to see the heatsink for this lol

replies(1): >>42180317 #

25. hendler ◴[19 Nov 24 04:42 UTC] No.42180144[source]▶

>>42179476 (TP) #

Check out BaseTen for performant use of GPUs

26. Aeolun ◴[19 Nov 24 04:44 UTC] No.42180150[source]▶

>>42180035 #

21 petabytes per second. Can push the whole internet over that chip xD

replies(2): >>42180630 #>>42180744 #

27. visarga ◴[19 Nov 24 05:13 UTC] No.42180265{3}[source]▶

>>42179717 #

You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.

28. futureshock ◴[19 Nov 24 05:27 UTC] No.42180317{3}[source]▶

>>42180116 #

They call it the “engine block”!

https://www.servethehome.com/a-cerebras-cs-2-engine-block-ba...

29. tomrod ◴[19 Nov 24 05:27 UTC] No.42180319{3}[source]▶

>>42179580 #

Amazing!

30. initplus ◴[19 Nov 24 06:01 UTC] No.42180442{4}[source]▶

>>42180050 #

Yeah you can see the cooling requirements by looking at their product images. https://cerebras.ai/wp-content/uploads/2021/04/Cerebras_Prod...

Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.

31. bboygravity ◴[19 Nov 24 06:10 UTC] No.42180470{4}[source]▶

>>42180050 #

That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.

replies(3): >>42180845 #>>42180967 #>>42181580 #

32. ekianjo ◴[19 Nov 24 06:15 UTC] No.42180490[source]▶

>>42180035 #

what kind of yield do they get on that size?

replies(2): >>42180600 #>>42180601 #

33. trsohmers ◴[19 Nov 24 06:21 UTC] No.42180527{4}[source]▶

>>42180050 #

Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.

The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B

replies(5): >>42180544 #>>42181290 #>>42181897 #>>42181931 #>>42190965 #

34. danpalmer ◴[19 Nov 24 06:25 UTC] No.42180544{5}[source]▶

>>42180527 #

Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.

replies(1): >>42180615 #

35. mikewarot ◴[19 Nov 24 06:32 UTC] No.42180569[source]▶

>>42179476 (TP) #

Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency with a conservative 1 Ghz clock rate.

Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.

In summary, 1 Gigatoken/second, divided into 100,000 separate users each getting 10k tokens/second.

This is the future I want to build.

replies(2): >>42181807 #>>42181968 #

36. bufferoverflow ◴[19 Nov 24 06:39 UTC] No.42180600{3}[source]▶

>>42180490 #

It's near 100%. Discussed here:

https://youtu.be/f4Dly8I8lMY?t=95

37. petra ◴[19 Nov 24 06:39 UTC] No.42180601{3}[source]▶

>>42180490 #

Part of their technology is managing/bypassing defects.

38. YetAnotherNick ◴[19 Nov 24 06:42 UTC] No.42180615{6}[source]▶

>>42180544 #

You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s.

The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.

replies(3): >>42182416 #>>42182750 #>>42190957 #

39. why_only_15 ◴[19 Nov 24 06:44 UTC] No.42180630{3}[source]▶

>>42180150 #

The number for that is I believe 1 terabit or 125GB/s -- 21 petabytes is the speed from the SRAM (~registers) to the cores (~ALU) for the whole chip. It's not especially impressive for SRAM speeds. The impressive thing is that they have an enormous amount of SRAM

40. why_only_15 ◴[19 Nov 24 07:05 UTC] No.42180735[source]▶

>>42179493 #

they have about 125GB/s of off-chip bandwidth

replies(1): >>42180792 #

41. KeplerBoy ◴[19 Nov 24 07:08 UTC] No.42180744{3}[source]▶

>>42180150 #

That's their on chip cache bandwidth. Usually that stuff isn't even measured in bandwidth but latency.

42. saagarjha ◴[19 Nov 24 07:19 UTC] No.42180792{3}[source]▶

>>42180735 #

Do they just not do HBM at all or

replies(1): >>42182712 #

43. schoen ◴[19 Nov 24 07:28 UTC] No.42180845{5}[source]▶

>>42180470 #

How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?

My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)

A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?

replies(1): >>42180977 #

44. joha4270 ◴[19 Nov 24 07:47 UTC] No.42180911{3}[source]▶

>>42180112 #

> Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?

Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.

On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.

45. dheera ◴[19 Nov 24 07:59 UTC] No.42180967{5}[source]▶

>>42180470 #

It will also mean 405B models will be uninteresting in 3 to 5 years if we follow the curve we've been on for the past decades.

replies(1): >>42181405 #

46. chaxor ◴[19 Nov 24 08:00 UTC] No.42180977{6}[source]▶

>>42180845 #

Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM. Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.

replies(1): >>42181319 #

47. szundi ◴[19 Nov 24 08:41 UTC] No.42181229{4}[source]▶

>>42180050 #

Parent wishes 70b not 405b though

48. meowface ◴[19 Nov 24 08:52 UTC] No.42181290{5}[source]▶

>>42180527 #

Thank you for the breakdown. Bit of an emotional journey.

"$500 in the future...? Oh, $30 million now, so that might be a while..."

replies(1): >>42181646 #

49. saagarjha ◴[19 Nov 24 08:59 UTC] No.42181319{7}[source]▶

>>42180977 #

Apple doesn’t sell those anymore.

replies(1): >>42184690 #

50. wkat4242 ◴[19 Nov 24 09:06 UTC] No.42181357{4}[source]▶

>>42180050 #

Yeah but what is in a 4090 is also comparable to a whole rack of servers a decade ago. The tech will get smaller.

51. ◴[19 Nov 24 09:06 UTC] No.42181360{3}[source]▶

>>42179580 #

52. int_19h ◴[19 Nov 24 09:13 UTC] No.42181405{6}[source]▶

>>42180967 #

I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).

53. dgfl ◴[19 Nov 24 09:38 UTC] No.42181580{5}[source]▶

>>42180470 #

Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.

Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.

In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.

On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking). Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.

It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.

replies(1): >>42182481 #

54. jamalaramala ◴[19 Nov 24 09:46 UTC] No.42181646{6}[source]▶

>>42181290 #

It took 30 years for computers go from entire rooms to desktops, and another 30 years to go from desktops to our pockets.

I don't know if we can extrapolate, but I can imagine AI inference on our desktops for $500 in a few years...

replies(2): >>42182572 #>>42197460 #

55. seangrogg ◴[19 Nov 24 10:10 UTC] No.42181807[source]▶

>>42180569 #

I'm actively trying to learn how to do exactly this, though I'm just getting started with FPGAs now so probably a very long range goal.

replies(1): >>42191030 #

56. sumedh ◴[19 Nov 24 10:24 UTC] No.42181897{5}[source]▶

>>42180527 #

> Based on their S1 filing and public statements

Is it a good stock to buy :)

57. petra ◴[19 Nov 24 10:29 UTC] No.42181931{5}[source]▶

>>42180527 #

Given those details they seem not much better on cost per token than nvidia based systems.

58. jacobgorm ◴[19 Nov 24 10:36 UTC] No.42181968[source]▶

>>42180569 #

See Convolutional Differentiable Logic Gate Networks https://arxiv.org/abs/2411.04732 , which is a small step in that direction.

59. Yizahi ◴[19 Nov 24 11:12 UTC] No.42182198{5}[source]▶

>>42179975 #

In a few thousand days (c) St. Altman

replies(1): >>42194265 #

60. latchkey ◴[19 Nov 24 11:49 UTC] No.42182416{7}[source]▶

>>42180615 #

Memory bandwidth and memory size. Along with power/cooling density.

Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.

You can see the direction AMD is going with this...

https://www.amd.com/en/products/accelerators/instinct/mi300/...

61. adrian_b ◴[19 Nov 24 11:57 UTC] No.42182481{6}[source]▶

>>42181580 #

Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.

The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.

Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.

replies(1): >>42190978 #

62. stefs ◴[19 Nov 24 12:10 UTC] No.42182572{7}[source]▶

>>42181646 #

well, we can AI inference on our desktops for $500 today, just with smaller models and far slower.

replies(1): >>42197536 #

63. why_only_15 ◴[19 Nov 24 12:30 UTC] No.42182712{4}[source]▶

>>42180792 #

I'm not too up to date but as I recall there are a lot of weirdnesses because of how big their chip is (e.g. thermal expansion being a problem). I believe they have a single giant line in the middle of the chip for this reason. maybe this makes HBM etc. hard? certainly their chip would be more appealing if they cut down the # of cores by 10x, added matrix units and added HBM but looks like they're not going to go this way.

64. Const-me ◴[19 Nov 24 12:35 UTC] No.42182750{7}[source]▶

>>42180615 #

> For 969 tok/s in int8, you need 392 TB/s memory bandwidth

I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.

replies(1): >>42190945 #

65. chaxor ◴[19 Nov 24 15:41 UTC] No.42184690{8}[source]▶

>>42181319 #

Aw man, are they selling only 4GB ones now?

More seriously, even 16GB was essentially the 'norm' in consumer PCs about 15 years ago.

66. zackangelo ◴[20 Nov 24 01:27 UTC] No.42189943[source]▶

>>42179794 #

Related recent discussion on twitter: https://x.com/Teknium1/status/1858987850739728635

Looks like other folks get 80 tok/s with max batch size, that's surprising to me but vLLM is definitely more optimized than my implementation.

67. ryao ◴[20 Nov 24 04:58 UTC] No.42190945{8}[source]▶

>>42182750 #

They claim to obtain that number with 8 to 20 concurrent users:

https://x.com/draecomino/status/1858998347090325846

68. ryao ◴[20 Nov 24 05:01 UTC] No.42190957{7}[source]▶

>>42180615 #

Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.

replies(1): >>42193142 #

69. ryao ◴[20 Nov 24 05:03 UTC] No.42190965{5}[source]▶

>>42180527 #

From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.

As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.

70. ryao ◴[20 Nov 24 05:06 UTC] No.42190978{7}[source]▶

>>42182481 #

Cerebras has a patent on the technique used to etch across scribe lines. Is there any prior work that would invalidate that patent?

By the way, I am a software developer, so you will not see me challenging their patent. I am just curious.

71. ryao ◴[20 Nov 24 05:09 UTC] No.42190988[source]▶

>>42179493 #

They do not use HBM. Offchip memory is accessible at 150GB/sec.

72. ryao ◴[20 Nov 24 05:21 UTC] No.42191030{3}[source]▶

>>42181807 #

There is not enough memory attached to FPGAs to do this. Some FOGAs come with 16GB of HBM attached, but that is not enough and the bandwidth provided is not as high as it is on GPUs. You would need to work out how to connect enough memory chips simultaneously to get high bandwidth and enough capacity in order for a FPGA solution to be performance competitive with a GPU solution.

replies(1): >>42191545 #

73. mikewarot ◴[20 Nov 24 07:29 UTC] No.42191545{4}[source]▶

>>42191030 #

Instead of separate memory/compute, I propose to fuse them.

74. YetAnotherNick ◴[20 Nov 24 12:07 UTC] No.42193142{8}[source]▶

>>42190957 #

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

replies(1): >>42197442 #

75. grahamj ◴[20 Nov 24 14:39 UTC] No.42194265{6}[source]▶

>>42182198 #

lol I almost said that too

76. ryao ◴[20 Nov 24 19:50 UTC] No.42197442{9}[source]▶

>>42193142 #

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

replies(1): >>42200967 #

77. ◴[20 Nov 24 19:51 UTC] No.42197460{7}[source]▶

>>42181646 #

78. ryao ◴[20 Nov 24 20:00 UTC] No.42197536{8}[source]▶

>>42182572 #

There is no need to use smaller models. You can run the biggest models such as llama 3.1 405B on a fairly low end desktop today:

https://github.com/lyogavin/airllm

However, it will be far slower as you said.

79. YetAnotherNick ◴[21 Nov 24 04:01 UTC] No.42200967{10}[source]▶

>>42197442 #

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

↑