It seems a Thunderbolt/USB4 external NVME enclosure can do about 2500-3000 MB/s which is about half of internal SSD. So not at all bad. It’ll just add an additional few tens of seconds while loading the model. Totally manageable.
Edit: in fact this is the proper route anyway since it allows you to work with huge model and intermediate FP16/FP32 files while quantizing. Internal storage, regardless of how much, will run out quickly.
This only applies to Macs and Mac-a-likes. Actual desktop PCs have many SATA ports and can store reasonable amounts of data without the crutch of external high latency storage making things iffy. I say this as someone with TBs of llama models on disk and I do quantization myself (sometimes).
BTW my computer cost <$900 w/17TB of storage currently and can run up to 34B 5bit llm. I could spend $250 more to upgrade to 128GB of DDR4 2666 ram and run the 65B/70B but 180B is out of the range. You do have to spend big money for that.
Or are you comparing with CPU inference? In which case apples-oranges.
How much do GPUs with 192GB of RAM cost?
Edit: also I think (unverified) very very few systems have multiple PCI 3/4 NVME slots. There are companies with PCI cards that can take NVMEs but that’ll in itself cost, without NVMEs, more than your $900 system.
But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.
Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.
Unless A6000 prices (or A100 prices) start plummeting.
My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.
You generally can't hook up large storage drives to nvme. Those are all tiny flash storage. I'm not sure why you brought it up.
1) Need 2 power supplies to run at peak power and most US outlets can't handle it
2) If you're running on 1 power supply need to limit power consumption and clock speed of the cards to prevent the PSU fuse from popping
3) To get all 4 cards running at once there is an adjustment needed for most MB bios
4) Finding an MB that can handle enough PCI-e lanes can be a challenge
Ultimately I think I get 95% or higher performance on one PSU with appropriate power limiting and 48 lanes (vs a full 64)
I went with the cheapest threadripper I could find but after testing the performance hit for going from 16-8 PCIe lanes was actually not that large and I would be okay going with 8 lanes for each card.
What’s your definition of large?
2TB and 4TB NVME are not tiny. You can even buy 8TB NVMEs, though those are more expensive and IMHO not worth it for this use case.
2TB NVMEs are $60-$100 right now.
You can attach several of those via Thunderbolt/USB4 enclosures providing 2500-3000 MB/s
1) I live in the EU with 240v mains and can pull 3600W from a single circuit and be OK. I'd also likely just limit the cards a bit for power efficiency as well as I do that already with a single 3090, running at 300W vs 350W max TDP makes such a small difference in performance I don't notice; this isn't gaming.
2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.
3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.
4) See 3. AMD Epyc 7713 is currently this choice with 3200mhz ddr4 giving the memory bandwidth I need for a specific CPU workload as well.
I'm just currently trying to figure out if it is worth it dealing with the massive import duty I'd get hit with if I bought the H12SSL-i, RAM, and the EPYC from one of the many Chinese Ebay resellers that sell these things at basically wholesale. I get hit with 20% VAT plus another something like 20% duty on imported computer parts. It's close to the point where it would be cheaper for me to book a round trip flight to Taiwan and take parts home in my luggage. People in the USA don't realize how little they are paying for computing equipment, relatively.
So, minus the potential tax implications this can all be done for about 8000 EUR.
I do computational fluid dynamics work and am working on GPU-assisted solvers, which is the reason for the 64-core EPYC as parts of that workflow are still going to need CPU threads to pump the numbers to the GPGPUs. Also this is why I don't just run stuff in cloud like so many people will suggest when talking about spending 8K on a rig vs $2/hr per A100 at Lambda labs. My specific work needs to stay on-prem.
All told it is less than two A6000's at current prices and is a whole machine to boot.
One suggestion, also limit clock speeds/voltages. There are transients when the 3090 loads models that can exceed double their typical draw, 4 starting at once can draw substantially more than I expected.
I had a 3x3090 rig. 2 of them are fried. 2 of them had two sided Bykski blocks (would never fit next to each others 2 slot apart, water header block with 90 degree corner plumbing is full 4x slot size minimum). 3090 are notorious for their vram chips, on the backside, failing due to insufficient cooling.
Also, 1kW of heat needs to be evacuated. Few hours of full power can easily heat up neighbouring rooms as well.
M2 Ultra [0] seems to be max 295w
Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.