My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

1. joelthelion ◴[29 Jul 25 14:58 UTC] No.44724227[source]▶

Apart from using a Mac, what can you use for inference with reasonable performance? Is a Mac the only realistic option at the moment?

replies(6): >>44724398 #>>44724419 #>>44724553 #>>44724563 #>>44724959 #>>44727049 #

2. AlexeyBrin ◴[29 Jul 25 15:12 UTC] No.44724398[source]▶

>>44724227 (TP) #

A gaming PC with an NVIDIA 4090/5090 will be more than adequate for running local models.

Where a Mac may beat the above is on the memory side, if a model requires more than 24/32 GB of GPU memory you are usually better off with a Mac with 64/128 GB of RAM. On a Mac the memory is shared between CPU and GPU, so the GPU can load larger models.

3. reilly3000 ◴[29 Jul 25 15:14 UTC] No.44724419[source]▶

>>44724227 (TP) #

The top 3 approaches I see a lot on r/localllama are:

1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB cards. There is a ceiling to vRAM that prevents the biggest models from being able to load, most can run most quants at great speeds

2. Epyc servers running CPU inference with lots of RAM at as high of memory bandwidth as is available. With these setups people are getting like 5-10 t/s but are able to run 450B parameter models.

3. High RAM Macs with as much memory bandwidth as possible. They are the best balanced approach and surprisingly reasonable relative to other options.

4. thenaturalist ◴[29 Jul 25 15:24 UTC] No.44724553[source]▶

>>44724227 (TP) #

This guy [0] does a ton of in-depth HW comparison/ benchmarking, including against Mac mini clusters and an M3 ultra.

0: https://www.youtube.com/@AZisk

5. regularfry ◴[29 Jul 25 15:25 UTC] No.44724563[source]▶

>>44724227 (TP) #

This one should just about fit on a box with an RTX 4090 and 64GB RAM (which is what I've got) at q4. Don't know what the performance will be yet. I'm hoping for an unsloth dynamic quant to get the most out of it.

replies(1): >>44725469 #

6. whimsicalism ◴[29 Jul 25 15:58 UTC] No.44724959[source]▶

>>44724227 (TP) #

you are almost certainly better off renting GPUs, but i understand self-hosting is an HN touchstone

replies(2): >>44725021 #>>44725699 #

7. qingcharles ◴[29 Jul 25 16:02 UTC] No.44725021[source]▶

>>44724959 #

This. Especially if you just want to try a bunch of different things out. Renting is insanely cheap -- to the point where I don't understand how the renters are making their money back unless they stole the hardware and power.

It can really help you figure a ton of things out before you blow the cash on your own hardware.

replies(1): >>44725157 #

8. 4b11b4 ◴[29 Jul 25 16:13 UTC] No.44725157{3}[source]▶

>>44725021 #

Recommended sites to rent from

replies(2): >>44725244 #>>44725337 #

9. doormatt ◴[29 Jul 25 16:20 UTC] No.44725244{4}[source]▶

>>44725157 #

runpod.io

10. whimsicalism ◴[29 Jul 25 16:26 UTC] No.44725337{4}[source]▶

>>44725157 #

runpod, vast, hyperbolic, prime intellect. if all you're doing is going to be running LLMs, you can pay per token on openrouter or some of the providers listed there

11. weberer ◴[29 Jul 25 16:36 UTC] No.44725469[source]▶

>>44724563 #

Whats important is VRAM, not system RAM. The 4090 has 16gb of VRAM so you'll be limited to smaller models at decent speeds. Of course, you can run models from system memory, but your tokens/second will be orders of magnitude slower. ARM Macs are the exception since they have unified memory, allowing high bandwidth between the GPU and the system's RAM.

replies(2): >>44729356 #>>44731634 #

12. mrinterweb ◴[29 Jul 25 16:55 UTC] No.44725699[source]▶

>>44724959 #

I don't know about that. I've had my RTX 4090 for nearly 3 years now. If I had a script that provisioned and deprovisioned a rented 4090 at $0.70/hr for an 8 hour work day for 20 work days per month. Assuming I get 2 paid weeks off per year + normal holidays over 3 years.

0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662

I bought my RTX 4090 for about $2200. I also had the pleasure of being able to use it for gaming when I wasn't working. To be fair, the VRAM requirements for local models keeps climbing and my 4090 isn't able to run many of the latest LLMs. Also, I omitted cost of electricity for my local LLM server cost. I have not been measuring total watts consumed by just that machine.

One nice thing about renting is that it give you flexibility in terms of what you want to try.

If you're really looking for the best deals look at 3rd party hosts serving open models for the API-based pricing, or honestly a Claude subscription can easily be worth it if you use LLMs a fair bit.

replies(1): >>44725791 #

13. whimsicalism ◴[29 Jul 25 17:03 UTC] No.44725791{3}[source]▶

>>44725699 #

1. I agree - there are absolutely scenarios in which it can make sense to buy a GPU and run it yourself. If you are managing a software firm with multiple employees, you very well might break even in less than a few years. But I would gander this is not the case for 90%+ of people self-hosting these models, unless they have some other good reason (like gaming) to buy a GPU.

2. I basically agree with your caveats - excluding electricity is a pretty big exclusion and I don't think that you've had 3 years of really high-value self-hostable models, I would really only say the last year and I'm somewhat skeptical of how good for ones that can be hosted in 24gb vram. 4x4090 is a different story.

14. badsectoracula ◴[29 Jul 25 19:00 UTC] No.44727049[source]▶

>>44724227 (TP) #

An Nvidia GPU is the most common answer, but personally i've done all my LLM use locally using mainly Mistral Small 3.1/3.2-based models and llama.cpp with an AMD RX 7900 XTX GPU. It only gives you ~4.71 tokens per second, but that is fast enough for a lot of uses. For example last month or so i wrote a raytracer[0][1] in C with Devstral Small 1.0 (based on Mistral Small 3.1). It wasn't "vibe coding" as much as a "co-op" where i'd go back and forth a chat interface (koboldcpp) and i'd, e.g. ask the LLM to implement some feature, then i'd switch to the editor and start writing code using that feature while the LLM was generating it in the background. Or, more often, i'd fix bugs in the LLM's code :-P.

FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year old PC that was the cheapest money could buy originally and became "decent" at the time i upgraded it ~5 years ago and i only added the GPU around Christmas as prices were dropping since AMD was about to release the new GPUs.

[0] https://i.imgur.com/FevOm0o.png

[1] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...

15. throwaway0123_5 ◴[29 Jul 25 23:17 UTC] No.44729356{3}[source]▶

>>44725469 #

iirc 4090s have 24GB

16. regularfry ◴[30 Jul 25 07:18 UTC] No.44731634{3}[source]▶

>>44725469 #

Yes and no. The 4090 has 24GB, not 16; but with a big MoE you're not getting everything in there anyway. In that case you really want all the weights in RAM so that swapping experts in isn't a load from disk.

It's not as good as unified RAM, but it's also workable.