←back to thread

255 points tbruckner | 2 comments | | HN request time: 0.421s | source
Show context
superkuh[dead post] ◴[] No.37420475[source]
[flagged]
YetAnotherNick ◴[] No.37420966[source]
You just need 4 3090($4000) to run it. And 4 3090 are definitely lot more powerful and versatile than an M2 mac.
replies(3): >>37421025 #>>37421444 #>>37424271 #
yumraj ◴[] No.37421025[source]
How much would that system cost, if you could easily buy those GPUs
replies(2): >>37421217 #>>37421964 #
PartiallyTyped ◴[] No.37421217[source]
Lanes will probably be an issue, so a threadripper pro or an epyc cpu, add half a grand at least for the motherboard and it’s starting to look grim.
replies(1): >>37421346 #
thfuran ◴[] No.37421346[source]
And that's before you even get your first power bill.
replies(2): >>37421385 #>>37422182 #
easygenes ◴[] No.37422182[source]
For LLM applications, the performance loss when power limiting 3090 to 200w is fairly low and you get peak perf/w.
replies(1): >>37426882 #
1. yumraj ◴[] No.37426882[source]
So even with power limiting, with 4 3090s, you’re looking at 800w from GPUs alone. So about 1000w give or take. Yes?

M2 Ultra [0] seems to be max 295w

[0] https://support.apple.com/en-us/HT213100

replies(1): >>37429539 #
2. easygenes ◴[] No.37429539[source]
Yeah, but watt for watt the 3090s will output more tokens, as a single 3090 has more memory bandwidth than an M2 Ultra. That's the main performance constraint for LLMs.

Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.