Running a 180B parameter LLM on a single Apple M2 Ultra

The data buffer size shown by Georgi here is 96GB, plus there is the other overhead; it states the recommended max working set size for this context is 147GB, so no Flacon 180B in Q4 as shown wouldn't fit on 4x 24GB 3090's (96GB VRAM).

But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.

Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.

Unless A6000 prices (or A100 prices) start plummeting.

My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.

replies(2): >>37421913 #>>37452138 #

7. iaw ◴[07 Sep 23 16:55 UTC] No.37421913[source]▶

>>37421444 #

I built an x4 3090 rig a little while ago. There are a few hurdles:

1) Need 2 power supplies to run at peak power and most US outlets can't handle it

2) If you're running on 1 power supply need to limit power consumption and clock speed of the cards to prevent the PSU fuse from popping

3) To get all 4 cards running at once there is an adjustment needed for most MB bios

4) Finding an MB that can handle enough PCI-e lanes can be a challenge

Ultimately I think I get 95% or higher performance on one PSU with appropriate power limiting and 48 lanes (vs a full 64)

replies(1): >>37422385 #

8. iaw ◴[07 Sep 23 16:59 UTC] No.37421964[source]▶

>>37421025 #

I build one on 4 used $800 3090s. All told $6-6.5K in used components to build an open air rig with consumer hardware.

I went with the cheapest threadripper I could find but after testing the performance hit for going from 16-8 PCIe lanes was actually not that large and I would be okay going with 8 lanes for each card.

9. easygenes ◴[07 Sep 23 17:12 UTC] No.37422182{4}[source]▶

>>37421346 #

For LLM applications, the performance loss when power limiting 3090 to 200w is fairly low and you get peak perf/w.

replies(1): >>37426882 #

10. mk_stjames ◴[07 Sep 23 17:24 UTC] No.37422385{3}[source]▶

>>37421913 #

Here are my current solutions, mostly:

1) I live in the EU with 240v mains and can pull 3600W from a single circuit and be OK. I'd also likely just limit the cards a bit for power efficiency as well as I do that already with a single 3090, running at 300W vs 350W max TDP makes such a small difference in performance I don't notice; this isn't gaming.

2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

4) See 3. AMD Epyc 7713 is currently this choice with 3200mhz ddr4 giving the memory bandwidth I need for a specific CPU workload as well.

I'm just currently trying to figure out if it is worth it dealing with the massive import duty I'd get hit with if I bought the H12SSL-i, RAM, and the EPYC from one of the many Chinese Ebay resellers that sell these things at basically wholesale. I get hit with 20% VAT plus another something like 20% duty on imported computer parts. It's close to the point where it would be cheaper for me to book a round trip flight to Taiwan and take parts home in my luggage. People in the USA don't realize how little they are paying for computing equipment, relatively.

So, minus the potential tax implications this can all be done for about 8000 EUR.

I do computational fluid dynamics work and am working on GPU-assisted solvers, which is the reason for the 64-core EPYC as parts of that workflow are still going to need CPU threads to pump the numbers to the GPGPUs. Also this is why I don't just run stuff in cloud like so many people will suggest when talking about spending 8K on a rig vs $2/hr per A100 at Lambda labs. My specific work needs to stay on-prem.

All told it is less than two A6000's at current prices and is a whole machine to boot.

replies(2): >>37422589 #>>37424346 #

11. iaw ◴[07 Sep 23 17:36 UTC] No.37422589{4}[source]▶

>>37422385 #

> 2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

One suggestion, also limit clock speeds/voltages. There are transients when the 3090 loads models that can exceed double their typical draw, 4 starting at once can draw substantially more than I expected.

12. matwood ◴[07 Sep 23 19:23 UTC] No.37424271[source]▶

>>37420966 (TP) #

> And 4 3090 are definitely lot more powerful and versatile than an M2 mac.

I'm trying to figure out how a bunch of graphics cards are more versatile than an entire computer. Maybe there's a very narrow use case.

13. eurekin ◴[07 Sep 23 19:29 UTC] No.37424346{4}[source]▶

>>37422385 #

> 3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

I had a 3x3090 rig. 2 of them are fried. 2 of them had two sided Bykski blocks (would never fit next to each others 2 slot apart, water header block with 90 degree corner plumbing is full 4x slot size minimum). 3090 are notorious for their vram chips, on the backside, failing due to insufficient cooling.

Also, 1kW of heat needs to be evacuated. Few hours of full power can easily heat up neighbouring rooms as well.

14. yumraj ◴[07 Sep 23 23:04 UTC] No.37426882{5}[source]▶

>>37422182 #

So even with power limiting, with 4 3090s, you’re looking at 800w from GPUs alone. So about 1000w give or take. Yes?

M2 Ultra [0] seems to be max 295w

[0] https://support.apple.com/en-us/HT213100

replies(1): >>37429539 #

15. easygenes ◴[08 Sep 23 05:01 UTC] No.37429539{6}[source]▶

>>37426882 #

Yeah, but watt for watt the 3090s will output more tokens, as a single 3090 has more memory bandwidth than an M2 Ultra. That's the main performance constraint for LLMs.

Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.

16. rogerdox14 ◴[10 Sep 23 01:47 UTC] No.37452138[source]▶

>>37421444 #

"recommended max working set size" is a property of the Mac the model is being run on, not the model itself. The model is smaller than that, otherwise it wouldn't be running on GPU.