Running a 180B parameter LLM on a single Apple M2 Ultra

(twitter.com)

255 points tbruckner | 1 comments | 07 Sep 23 14:36 UTC | HN request time: 0.263s | source

Show context

superkuh[dead post] ◴[07 Sep 23 15:33 UTC] No.37420475[source]▶

>>37419518 (OP) #

[flagged]

YetAnotherNick ◴[07 Sep 23 16:00 UTC] No.37420966[source]▶

>>37420475 #

You just need 4 3090($4000) to run it. And 4 3090 are definitely lot more powerful and versatile than an M2 mac.

replies(3): >>37421025 #>>37421444 #>>37424271 #

mk_stjames ◴[07 Sep 23 16:27 UTC] No.37421444[source]▶

>>37420966 #

The data buffer size shown by Georgi here is 96GB, plus there is the other overhead; it states the recommended max working set size for this context is 147GB, so no Flacon 180B in Q4 as shown wouldn't fit on 4x 24GB 3090's (96GB VRAM).

But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.

Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.

Unless A6000 prices (or A100 prices) start plummeting.

My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.

replies(2): >>37421913 #>>37452138 #

iaw ◴[07 Sep 23 16:55 UTC] No.37421913[source]▶

>>37421444 #

I built an x4 3090 rig a little while ago. There are a few hurdles:

1) Need 2 power supplies to run at peak power and most US outlets can't handle it

2) If you're running on 1 power supply need to limit power consumption and clock speed of the cards to prevent the PSU fuse from popping

3) To get all 4 cards running at once there is an adjustment needed for most MB bios

4) Finding an MB that can handle enough PCI-e lanes can be a challenge

Ultimately I think I get 95% or higher performance on one PSU with appropriate power limiting and 48 lanes (vs a full 64)

replies(1): >>37422385 #

mk_stjames ◴[07 Sep 23 17:24 UTC] No.37422385[source]▶

>>37421913 #

Here are my current solutions, mostly:

1) I live in the EU with 240v mains and can pull 3600W from a single circuit and be OK. I'd also likely just limit the cards a bit for power efficiency as well as I do that already with a single 3090, running at 300W vs 350W max TDP makes such a small difference in performance I don't notice; this isn't gaming.

2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

4) See 3. AMD Epyc 7713 is currently this choice with 3200mhz ddr4 giving the memory bandwidth I need for a specific CPU workload as well.

I'm just currently trying to figure out if it is worth it dealing with the massive import duty I'd get hit with if I bought the H12SSL-i, RAM, and the EPYC from one of the many Chinese Ebay resellers that sell these things at basically wholesale. I get hit with 20% VAT plus another something like 20% duty on imported computer parts. It's close to the point where it would be cheaper for me to book a round trip flight to Taiwan and take parts home in my luggage. People in the USA don't realize how little they are paying for computing equipment, relatively.

So, minus the potential tax implications this can all be done for about 8000 EUR.

I do computational fluid dynamics work and am working on GPU-assisted solvers, which is the reason for the 64-core EPYC as parts of that workflow are still going to need CPU threads to pump the numbers to the GPGPUs. Also this is why I don't just run stuff in cloud like so many people will suggest when talking about spending 8K on a rig vs $2/hr per A100 at Lambda labs. My specific work needs to stay on-prem.

All told it is less than two A6000's at current prices and is a whole machine to boot.

replies(2): >>37422589 #>>37424346 #

1. eurekin ◴[07 Sep 23 19:29 UTC] No.37424346[source]▶

>>37422385 #

> 3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

I had a 3x3090 rig. 2 of them are fried. 2 of them had two sided Bykski blocks (would never fit next to each others 2 slot apart, water header block with 90 degree corner plumbing is full 4x slot size minimum). 3090 are notorious for their vram chips, on the backside, failing due to insufficient cooling.

Also, 1kW of heat needs to be evacuated. Few hours of full power can easily heat up neighbouring rooms as well.

↑