Most active commenters
  • whimsicalism(3)

←back to thread

577 points simonw | 16 comments | | HN request time: 0.817s | source | bottom
1. joelthelion ◴[] No.44724227[source]
Apart from using a Mac, what can you use for inference with reasonable performance? Is a Mac the only realistic option at the moment?
replies(6): >>44724398 #>>44724419 #>>44724553 #>>44724563 #>>44724959 #>>44727049 #
2. AlexeyBrin ◴[] No.44724398[source]
A gaming PC with an NVIDIA 4090/5090 will be more than adequate for running local models.

Where a Mac may beat the above is on the memory side, if a model requires more than 24/32 GB of GPU memory you are usually better off with a Mac with 64/128 GB of RAM. On a Mac the memory is shared between CPU and GPU, so the GPU can load larger models.

3. reilly3000 ◴[] No.44724419[source]
The top 3 approaches I see a lot on r/localllama are:

1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB cards. There is a ceiling to vRAM that prevents the biggest models from being able to load, most can run most quants at great speeds

2. Epyc servers running CPU inference with lots of RAM at as high of memory bandwidth as is available. With these setups people are getting like 5-10 t/s but are able to run 450B parameter models.

3. High RAM Macs with as much memory bandwidth as possible. They are the best balanced approach and surprisingly reasonable relative to other options.

4. thenaturalist ◴[] No.44724553[source]
This guy [0] does a ton of in-depth HW comparison/ benchmarking, including against Mac mini clusters and an M3 ultra.

0: https://www.youtube.com/@AZisk

5. regularfry ◴[] No.44724563[source]
This one should just about fit on a box with an RTX 4090 and 64GB RAM (which is what I've got) at q4. Don't know what the performance will be yet. I'm hoping for an unsloth dynamic quant to get the most out of it.
replies(1): >>44725469 #
6. whimsicalism ◴[] No.44724959[source]
you are almost certainly better off renting GPUs, but i understand self-hosting is an HN touchstone
replies(2): >>44725021 #>>44725699 #
7. qingcharles ◴[] No.44725021[source]
This. Especially if you just want to try a bunch of different things out. Renting is insanely cheap -- to the point where I don't understand how the renters are making their money back unless they stole the hardware and power.

It can really help you figure a ton of things out before you blow the cash on your own hardware.

replies(1): >>44725157 #
8. 4b11b4 ◴[] No.44725157{3}[source]
Recommended sites to rent from
replies(2): >>44725244 #>>44725337 #
9. doormatt ◴[] No.44725244{4}[source]
runpod.io
10. whimsicalism ◴[] No.44725337{4}[source]
runpod, vast, hyperbolic, prime intellect. if all you're doing is going to be running LLMs, you can pay per token on openrouter or some of the providers listed there
11. weberer ◴[] No.44725469[source]
Whats important is VRAM, not system RAM. The 4090 has 16gb of VRAM so you'll be limited to smaller models at decent speeds. Of course, you can run models from system memory, but your tokens/second will be orders of magnitude slower. ARM Macs are the exception since they have unified memory, allowing high bandwidth between the GPU and the system's RAM.
replies(2): >>44729356 #>>44731634 #
12. mrinterweb ◴[] No.44725699[source]
I don't know about that. I've had my RTX 4090 for nearly 3 years now. If I had a script that provisioned and deprovisioned a rented 4090 at $0.70/hr for an 8 hour work day for 20 work days per month. Assuming I get 2 paid weeks off per year + normal holidays over 3 years.

0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662

I bought my RTX 4090 for about $2200. I also had the pleasure of being able to use it for gaming when I wasn't working. To be fair, the VRAM requirements for local models keeps climbing and my 4090 isn't able to run many of the latest LLMs. Also, I omitted cost of electricity for my local LLM server cost. I have not been measuring total watts consumed by just that machine.

One nice thing about renting is that it give you flexibility in terms of what you want to try.

If you're really looking for the best deals look at 3rd party hosts serving open models for the API-based pricing, or honestly a Claude subscription can easily be worth it if you use LLMs a fair bit.

replies(1): >>44725791 #
13. whimsicalism ◴[] No.44725791{3}[source]
1. I agree - there are absolutely scenarios in which it can make sense to buy a GPU and run it yourself. If you are managing a software firm with multiple employees, you very well might break even in less than a few years. But I would gander this is not the case for 90%+ of people self-hosting these models, unless they have some other good reason (like gaming) to buy a GPU.

2. I basically agree with your caveats - excluding electricity is a pretty big exclusion and I don't think that you've had 3 years of really high-value self-hostable models, I would really only say the last year and I'm somewhat skeptical of how good for ones that can be hosted in 24gb vram. 4x4090 is a different story.

14. badsectoracula ◴[] No.44727049[source]
An Nvidia GPU is the most common answer, but personally i've done all my LLM use locally using mainly Mistral Small 3.1/3.2-based models and llama.cpp with an AMD RX 7900 XTX GPU. It only gives you ~4.71 tokens per second, but that is fast enough for a lot of uses. For example last month or so i wrote a raytracer[0][1] in C with Devstral Small 1.0 (based on Mistral Small 3.1). It wasn't "vibe coding" as much as a "co-op" where i'd go back and forth a chat interface (koboldcpp) and i'd, e.g. ask the LLM to implement some feature, then i'd switch to the editor and start writing code using that feature while the LLM was generating it in the background. Or, more often, i'd fix bugs in the LLM's code :-P.

FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year old PC that was the cheapest money could buy originally and became "decent" at the time i upgraded it ~5 years ago and i only added the GPU around Christmas as prices were dropping since AMD was about to release the new GPUs.

[0] https://i.imgur.com/FevOm0o.png

[1] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...

15. throwaway0123_5 ◴[] No.44729356{3}[source]
iirc 4090s have 24GB
16. regularfry ◴[] No.44731634{3}[source]
Yes and no. The 4090 has 24GB, not 16; but with a big MoE you're not getting everything in there anyway. In that case you really want all the weights in RAM so that swapping experts in isn't a load from disk.

It's not as good as unified RAM, but it's also workable.