Most active commenters

fm2606(5)
embedding-shape(3)

Popular/hot comments

>>45775404 #

Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

Show context

lreeves ◴[31 Oct 25 15:13 UTC] No.45772938[source]▶

>>45771870 (OP) #

I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.

I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.

Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.

replies(2): >>45774585 #>>45775707 #

1. fm2606 ◴[31 Oct 25 17:38 UTC] No.45774585[source]▶

>>45772938 #

gpt-oss-120b is amazing. I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc). ChatGPT finished a 50 question quiz in 6 min with a score of 46 / 50. gpt-oss-120b took over an hour but got 47 / 50. All the other local LLMs I tried were small and performed way worse, like less than 50% correct.

I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.

EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.

replies(8): >>45774966 #>>45775404 #>>45775557 #>>45777956 #>>45778679 #>>45779534 #>>45781600 #>>45783342 #

2. embedding-shape ◴[31 Oct 25 18:13 UTC] No.45774966[source]▶

>>45774585 (TP) #

> gpt-oss-120b is amazing

Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.

I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.

If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.

3. lacoolj ◴[31 Oct 25 18:54 UTC] No.45775404[source]▶

>>45774585 (TP) #

you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?

I'm about to try this out lol

The 20b model is not great, so I'm hoping 120b is the golden ticket.

replies(4): >>45775537 #>>45775558 #>>45777215 #>>45777455 #

4. fm2606 ◴[31 Oct 25 19:06 UTC] No.45775537[source]▶

>>45775404 #

Hmmm...now that you say that, it might have been the 20b model.

And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.

Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.

I was really impressed but disappointed at the huge disparity between time the two.

5. ◴[31 Oct 25 19:08 UTC] No.45775557[source]▶

>>45774585 (TP) #

6. fm2606 ◴[31 Oct 25 19:08 UTC] No.45775558[source]▶

>>45775404 #

Everything I run, even the small models, some amount goes to the GPU and the rest to RAM.

7. gunalx ◴[31 Oct 25 22:07 UTC] No.45777215[source]▶

>>45775404 #

I have in many cases had better results with the 20b model, over the 120b model. Mostly because it is faster and I can iterate prompts quicker to choerce it to follow instructions.

replies(1): >>45781257 #

8. ThatPlayer ◴[31 Oct 25 22:36 UTC] No.45777455[source]▶

>>45775404 #

With MoE models like gpt-oss, you can run some layers on the CPU (and some on GPU): https://github.com/ggml-org/llama.cpp/discussions/15396

Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"

9. rovr138 ◴[31 Oct 25 23:47 UTC] No.45777956[source]▶

>>45774585 (TP) #

> I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc)

If you share the scripts to gather the GCP documentation this, that'd be great. Because I have had an idea to do something like this, and the part I don't want to deal with is getting the data

replies(1): >>45786605 #

10. whatreason ◴[01 Nov 25 02:02 UTC] No.45778679[source]▶

>>45774585 (TP) #

What do you use to run gpt-oss here? ollama, vLLM, etc

replies(1): >>45781029 #

11. adastra22 ◴[01 Nov 25 05:54 UTC] No.45779534[source]▶

>>45774585 (TP) #

What quantization settings?

12. embedding-shape ◴[01 Nov 25 12:09 UTC] No.45781029[source]▶

>>45778679 #

Not parent, but frequent user of GPT-OSS, tried all different ways of running it. Choice goes something like this:

- Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS

- Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version

- Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama

Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.

13. embedding-shape ◴[01 Nov 25 12:53 UTC] No.45781257{3}[source]▶

>>45777215 #

> had better results with the 20b model, over the 120b model

The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.

14. giorgioz ◴[01 Nov 25 13:45 UTC] No.45781600[source]▶

>>45774585 (TP) #

on what hardware you manate to run gpt-oss-120b locally?

15. gkfasdfasdf ◴[01 Nov 25 17:09 UTC] No.45783342[source]▶

>>45774585 (TP) #

What were you using for RAG? Did you build your own or some off the shelf solution (e.g. openwebui)

replies(1): >>45786666 #

16. fm2606 ◴[01 Nov 25 23:52 UTC] No.45786605[source]▶

>>45777956 #

I tried scripts but got blocked. I used wget to download tthem

17. fm2606 ◴[02 Nov 25 00:00 UTC] No.45786666[source]▶

>>45783342 #

I used pg vector chunking on paragraphs. For the answers I saved in a flat text file and then parsed to what I needed.

For parsing and vectorizing of the GCP docs I used a Python script. For reading each quiz question, getting a text embedding and submitting to an LLM, I used Spring AI.

It was all roll your own.

But like I stated in my original post I deleted it without backup or vcs. It was the wrong directory that I deleted. Rookie mistake for which I know better.

↑