←back to thread

Running a 180B parameter LLM on a single Apple M2 Ultra

(twitter.com)

255 points tbruckner | 2 comments | 07 Sep 23 14:36 UTC | HN request time: 0.421s | source

Show context

superkuh[dead post] ◴[07 Sep 23 15:33 UTC] No.37420475[source]▶

>>37419518 (OP) #

[flagged]

YetAnotherNick ◴[07 Sep 23 16:00 UTC] No.37420966[source]▶

You just need 4 3090($4000) to run it. And 4 3090 are definitely lot more powerful and versatile than an M2 mac.

replies(3): >>37421025 #>>37421444 #>>37424271 #

yumraj ◴[07 Sep 23 16:04 UTC] No.37421025[source]▶

How much would that system cost, if you could easily buy those GPUs

replies(2): >>37421217 #>>37421964 #

PartiallyTyped ◴[07 Sep 23 16:14 UTC] No.37421217[source]▶

Lanes will probably be an issue, so a threadripper pro or an epyc cpu, add half a grand at least for the motherboard and it’s starting to look grim.

replies(1): >>37421346 #

thfuran ◴[07 Sep 23 16:22 UTC] No.37421346[source]▶

And that's before you even get your first power bill.

replies(2): >>37421385 #>>37422182 #

easygenes ◴[07 Sep 23 17:12 UTC] No.37422182[source]▶

For LLM applications, the performance loss when power limiting 3090 to 200w is fairly low and you get peak perf/w.

replies(1): >>37426882 #

1. yumraj ◴[07 Sep 23 23:04 UTC] No.37426882[source]▶

So even with power limiting, with 4 3090s, you’re looking at 800w from GPUs alone. So about 1000w give or take. Yes?

M2 Ultra [0] seems to be max 295w

[0] https://support.apple.com/en-us/HT213100

replies(1): >>37429539 #

2. easygenes ◴[08 Sep 23 05:01 UTC] No.37429539[source]▶

>>37426882 (TP) #

Yeah, but watt for watt the 3090s will output more tokens, as a single 3090 has more memory bandwidth than an M2 Ultra. That's the main performance constraint for LLMs.

Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.