But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.
Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.
Unless A6000 prices (or A100 prices) start plummeting.
My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.
1) Need 2 power supplies to run at peak power and most US outlets can't handle it
2) If you're running on 1 power supply need to limit power consumption and clock speed of the cards to prevent the PSU fuse from popping
3) To get all 4 cards running at once there is an adjustment needed for most MB bios
4) Finding an MB that can handle enough PCI-e lanes can be a challenge
Ultimately I think I get 95% or higher performance on one PSU with appropriate power limiting and 48 lanes (vs a full 64)
I went with the cheapest threadripper I could find but after testing the performance hit for going from 16-8 PCIe lanes was actually not that large and I would be okay going with 8 lanes for each card.
1) I live in the EU with 240v mains and can pull 3600W from a single circuit and be OK. I'd also likely just limit the cards a bit for power efficiency as well as I do that already with a single 3090, running at 300W vs 350W max TDP makes such a small difference in performance I don't notice; this isn't gaming.
2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.
3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.
4) See 3. AMD Epyc 7713 is currently this choice with 3200mhz ddr4 giving the memory bandwidth I need for a specific CPU workload as well.
I'm just currently trying to figure out if it is worth it dealing with the massive import duty I'd get hit with if I bought the H12SSL-i, RAM, and the EPYC from one of the many Chinese Ebay resellers that sell these things at basically wholesale. I get hit with 20% VAT plus another something like 20% duty on imported computer parts. It's close to the point where it would be cheaper for me to book a round trip flight to Taiwan and take parts home in my luggage. People in the USA don't realize how little they are paying for computing equipment, relatively.
So, minus the potential tax implications this can all be done for about 8000 EUR.
I do computational fluid dynamics work and am working on GPU-assisted solvers, which is the reason for the 64-core EPYC as parts of that workflow are still going to need CPU threads to pump the numbers to the GPGPUs. Also this is why I don't just run stuff in cloud like so many people will suggest when talking about spending 8K on a rig vs $2/hr per A100 at Lambda labs. My specific work needs to stay on-prem.
All told it is less than two A6000's at current prices and is a whole machine to boot.
One suggestion, also limit clock speeds/voltages. There are transients when the 3090 loads models that can exceed double their typical draw, 4 starting at once can draw substantially more than I expected.
I had a 3x3090 rig. 2 of them are fried. 2 of them had two sided Bykski blocks (would never fit next to each others 2 slot apart, water header block with 90 degree corner plumbing is full 4x slot size minimum). 3090 are notorious for their vram chips, on the backside, failing due to insufficient cooling.
Also, 1kW of heat needs to be evacuated. Few hours of full power can easily heat up neighbouring rooms as well.
M2 Ultra [0] seems to be max 295w
Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.