Running a 180B parameter LLM on a single Apple M2 Ultra

1. sbierwagen ◴[07 Sep 23 15:34 UTC] No.37420490[source]▶

>>37420475 (TP) #

M2 Mac Studio with 192gb of ram is US$5,599 right now.

replies(3): >>37420616 #>>37420693 #>>37427799 #

2. ◴[07 Sep 23 15:41 UTC] No.37420616[source]▶

>>37420490 #

3. piskov ◴[07 Sep 23 15:47 UTC] No.37420725{3}[source]▶

>>37420693 #

You do understand that you can connect thunderbolt external storage (not just usb3 one)?

replies(1): >>37420784 #

4. diffeomorphism ◴[07 Sep 23 15:50 UTC] No.37420784{4}[source]▶

>>37420725 #

That does not really make 1tb of non-upgradable storage in a $5k+ device any less ridiculous though.

replies(1): >>37420864 #

5. yumraj ◴[07 Sep 23 15:50 UTC] No.37420789{3}[source]▶

>>37420693 #

It’s not useless.

It seems a Thunderbolt/USB4 external NVME enclosure can do about 2500-3000 MB/s which is about half of internal SSD. So not at all bad. It’ll just add an additional few tens of seconds while loading the model. Totally manageable.

Edit: in fact this is the proper route anyway since it allows you to work with huge model and intermediate FP16/FP32 files while quantizing. Internal storage, regardless of how much, will run out quickly.

replies(1): >>37420889 #

6. yumraj ◴[07 Sep 23 15:55 UTC] No.37420864{5}[source]▶

>>37420784 #

That is true, but a whole separate discussion.

It applies to RAM too. My 32GB Mac Studio seemed pretty good before the LLMs.

7. superkuh ◴[07 Sep 23 15:57 UTC] No.37420889{4}[source]▶

>>37420789 #

>Internal storage, regardless of how much, will run out quickly.

This only applies to Macs and Mac-a-likes. Actual desktop PCs have many SATA ports and can store reasonable amounts of data without the crutch of external high latency storage making things iffy. I say this as someone with TBs of llama models on disk and I do quantization myself (sometimes).

BTW my computer cost <$900 w/17TB of storage currently and can run up to 34B 5bit llm. I could spend $250 more to upgrade to 128GB of DDR4 2666 ram and run the 65B/70B but 180B is out of the range. You do have to spend big money for that.

replies(4): >>37421057 #>>37421079 #>>37421096 #>>37422593 #

8. bananapub ◴[07 Sep 23 16:00 UTC] No.37420958{3}[source]▶

>>37420693 #

> gets price wrong

> is corrected

> pivots to some weird rant about 1TB of extremely high performance flash being "useless"

wouldn't it have saved time to just not post either of these comments?

replies(2): >>37421235 #>>37421807 #

9. YetAnotherNick ◴[07 Sep 23 16:00 UTC] No.37420966[source]▶

>>37420475 (TP) #

You just need 4 3090($4000) to run it. And 4 3090 are definitely lot more powerful and versatile than an M2 mac.

replies(3): >>37421025 #>>37421444 #>>37424271 #

10. yumraj ◴[07 Sep 23 16:04 UTC] No.37421025[source]▶

>>37420966 #

How much would that system cost, if you could easily buy those GPUs

replies(2): >>37421217 #>>37421964 #

11. andromeduck ◴[07 Sep 23 16:06 UTC] No.37421057{5}[source]▶

>>37420889 #

Who TF is still using SATA with SSDs?!

12. yumraj ◴[07 Sep 23 16:07 UTC] No.37421079{5}[source]▶

>>37420889 #

We’re talking about 192GB of GPU accessible memory here.

Or are you comparing with CPU inference? In which case apples-oranges.

How much do GPUs with 192GB of RAM cost?

Edit: also I think (unverified) very very few systems have multiple PCI 3/4 NVME slots. There are companies with PCI cards that can take NVMEs but that’ll in itself cost, without NVMEs, more than your $900 system.

replies(1): >>37421909 #

13. LTL_FTC ◴[07 Sep 23 16:08 UTC] No.37421096{5}[source]▶

>>37420889 #

“external USB3 SSD... slowly” so which is it? Sata ports aren’t exactly faster than usb3. If you want speed you need pcie drives. Not sata. Thunderbolt is a great solution. Plus, my network storage sustains 10Gb networking. There are other avenues

14. PartiallyTyped ◴[07 Sep 23 16:14 UTC] No.37421217{3}[source]▶

>>37421025 #

Lanes will probably be an issue, so a threadripper pro or an epyc cpu, add half a grand at least for the motherboard and it’s starting to look grim.

replies(1): >>37421346 #

15. oefrha ◴[07 Sep 23 16:15 UTC] No.37421235{4}[source]▶

>>37420958 #

Note to other commenters: a simple “sorry, I was wrong” is more graceful and less embarrassing.

16. thfuran ◴[07 Sep 23 16:22 UTC] No.37421346{4}[source]▶

>>37421217 #

And that's before you even get your first power bill.

replies(2): >>37421385 #>>37422182 #

17. PartiallyTyped ◴[07 Sep 23 16:24 UTC] No.37421385{5}[source]▶

>>37421346 #

hey, at least you will cut down on the heating costs!

18. mk_stjames ◴[07 Sep 23 16:27 UTC] No.37421444[source]▶

>>37420966 #

The data buffer size shown by Georgi here is 96GB, plus there is the other overhead; it states the recommended max working set size for this context is 147GB, so no Flacon 180B in Q4 as shown wouldn't fit on 4x 24GB 3090's (96GB VRAM).

But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.

Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.

Unless A6000 prices (or A100 prices) start plummeting.

My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.

replies(2): >>37421913 #>>37452138 #

19. m3kw9 ◴[07 Sep 23 16:47 UTC] No.37421791[source]▶

>>37420475 (TP) #

I just rather pay 20 a month to get a share of 10000 h100 running the benchmark LLM instead

20. acdha ◴[07 Sep 23 16:48 UTC] No.37421807{4}[source]▶

>>37420958 #

Yeah, I flagged that because it’s basically indistinguishable from trolling. All it’s going to do is distract from the actual thread topic - nobody is going to learn something useful.

replies(1): >>37422339 #

21. ErneX ◴[07 Sep 23 16:51 UTC] No.37421852{3}[source]▶

>>37420693 #

You can plug an NVME thunderbolt caddy instead, it won’t reach a good NVME SSD top speeds but it will hover around 2800MB/s r+w.

Its internal SSD at 1TB or greater capacity is at least twice as fast.

22. superkuh ◴[07 Sep 23 16:55 UTC] No.37421909{6}[source]▶

>>37421079 #

Yes, CPU inference. For llama.cpp with Apple M1/M2 the GPU inference (via metal) is about 5x faster than CPU for text generation and about the same speed for prompt processing. Not insignificant but not giant either.

You generally can't hook up large storage drives to nvme. Those are all tiny flash storage. I'm not sure why you brought it up.

replies(1): >>37422034 #

23. iaw ◴[07 Sep 23 16:55 UTC] No.37421913{3}[source]▶

>>37421444 #

I built an x4 3090 rig a little while ago. There are a few hurdles:

1) Need 2 power supplies to run at peak power and most US outlets can't handle it

2) If you're running on 1 power supply need to limit power consumption and clock speed of the cards to prevent the PSU fuse from popping

3) To get all 4 cards running at once there is an adjustment needed for most MB bios

4) Finding an MB that can handle enough PCI-e lanes can be a challenge

Ultimately I think I get 95% or higher performance on one PSU with appropriate power limiting and 48 lanes (vs a full 64)

replies(1): >>37422385 #

24. iaw ◴[07 Sep 23 16:59 UTC] No.37421964{3}[source]▶

>>37421025 #

I build one on 4 used $800 3090s. All told $6-6.5K in used components to build an open air rig with consumer hardware.

I went with the cheapest threadripper I could find but after testing the performance hit for going from 16-8 PCIe lanes was actually not that large and I would be okay going with 8 lanes for each card.

25. yumraj ◴[07 Sep 23 17:02 UTC] No.37422034{7}[source]▶

>>37421909 #

> You generally can't hook up large storage drives to nvme. Those are all tiny flash storage.

What’s your definition of large?

2TB and 4TB NVME are not tiny. You can even buy 8TB NVMEs, though those are more expensive and IMHO not worth it for this use case.

2TB NVMEs are $60-$100 right now.

You can attach several of those via Thunderbolt/USB4 enclosures providing 2500-3000 MB/s

26. easygenes ◴[07 Sep 23 17:12 UTC] No.37422182{5}[source]▶

>>37421346 #

For LLM applications, the performance loss when power limiting 3090 to 200w is fairly low and you get peak perf/w.

replies(1): >>37426882 #

27. mk_stjames ◴[07 Sep 23 17:24 UTC] No.37422385{4}[source]▶

>>37421913 #

Here are my current solutions, mostly:

1) I live in the EU with 240v mains and can pull 3600W from a single circuit and be OK. I'd also likely just limit the cards a bit for power efficiency as well as I do that already with a single 3090, running at 300W vs 350W max TDP makes such a small difference in performance I don't notice; this isn't gaming.

2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

4) See 3. AMD Epyc 7713 is currently this choice with 3200mhz ddr4 giving the memory bandwidth I need for a specific CPU workload as well.

I'm just currently trying to figure out if it is worth it dealing with the massive import duty I'd get hit with if I bought the H12SSL-i, RAM, and the EPYC from one of the many Chinese Ebay resellers that sell these things at basically wholesale. I get hit with 20% VAT plus another something like 20% duty on imported computer parts. It's close to the point where it would be cheaper for me to book a round trip flight to Taiwan and take parts home in my luggage. People in the USA don't realize how little they are paying for computing equipment, relatively.

So, minus the potential tax implications this can all be done for about 8000 EUR.

I do computational fluid dynamics work and am working on GPU-assisted solvers, which is the reason for the 64-core EPYC as parts of that workflow are still going to need CPU threads to pump the numbers to the GPGPUs. Also this is why I don't just run stuff in cloud like so many people will suggest when talking about spending 8K on a rig vs $2/hr per A100 at Lambda labs. My specific work needs to stay on-prem.

All told it is less than two A6000's at current prices and is a whole machine to boot.

replies(2): >>37422589 #>>37424346 #

28. iaw ◴[07 Sep 23 17:36 UTC] No.37422589{5}[source]▶

>>37422385 #

> 2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

One suggestion, also limit clock speeds/voltages. There are transients when the 3090 loads models that can exceed double their typical draw, 4 starting at once can draw substantially more than I expected.

29. GeekyBear ◴[07 Sep 23 17:36 UTC] No.37422593{5}[source]▶

>>37420889 #

> Actual desktop PCs have many SATA ports

How many of those PCs have 10 Gigabit Ethernet by default? You can set up fast networked storage in any size you like and share it with many computers, not just one.

30. acdha ◴[07 Sep 23 18:32 UTC] No.37423503{6}[source]▶

>>37422339 #

Dude, just admit you were wrong. This is just painful - especially as other people are pointing out that this is a hard number to beat.

31. matwood ◴[07 Sep 23 19:23 UTC] No.37424271[source]▶

>>37420966 #

> And 4 3090 are definitely lot more powerful and versatile than an M2 mac.

I'm trying to figure out how a bunch of graphics cards are more versatile than an entire computer. Maybe there's a very narrow use case.

32. eurekin ◴[07 Sep 23 19:29 UTC] No.37424346{5}[source]▶

>>37422385 #

> 3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

I had a 3x3090 rig. 2 of them are fried. 2 of them had two sided Bykski blocks (would never fit next to each others 2 slot apart, water header block with 90 degree corner plumbing is full 4x slot size minimum). 3090 are notorious for their vram chips, on the backside, failing due to insufficient cooling.

Also, 1kW of heat needs to be evacuated. Few hours of full power can easily heat up neighbouring rooms as well.

33. acchow ◴[07 Sep 23 20:01 UTC] No.37424782{6}[source]▶

>>37422339 #

I don't really agree. This is a desktop machine so it will be staying put. It has thunderbolt 4 which can exceed 3GB/s (24Gbps) on external SSDs. Don't think that expansion is useless

34. yumraj ◴[07 Sep 23 23:04 UTC] No.37426882{6}[source]▶

>>37422182 #

So even with power limiting, with 4 3090s, you’re looking at 800w from GPUs alone. So about 1000w give or take. Yes?

M2 Ultra [0] seems to be max 295w

[0] https://support.apple.com/en-us/HT213100

replies(1): >>37429539 #

35. catchnear4321 ◴[08 Sep 23 00:45 UTC] No.37427799[source]▶

>>37420490 #

i really can’t afford comments like this.

36. easygenes ◴[08 Sep 23 05:01 UTC] No.37429539{7}[source]▶

>>37426882 #

Yeah, but watt for watt the 3090s will output more tokens, as a single 3090 has more memory bandwidth than an M2 Ultra. That's the main performance constraint for LLMs.

Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.

37. rogerdox14 ◴[10 Sep 23 01:47 UTC] No.37452138{3}[source]▶

>>37421444 #

"recommended max working set size" is a property of the Mac the model is being run on, not the model itself. The model is smaller than that, otherwise it wouldn't be running on GPU.