←back to thread

326 points threeturn | 3 comments | | HN request time: 0.474s | source

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

Show context
dust42 ◴[] No.45773793[source]
On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.

For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.

When giving well defined small tasks, it is as good as any frontier model.

I like the speed and low latency and the availability while on the plane/train or off-grid.

Also decent FIM with the llama.cpp VSCode plugin.

If I need more intelligence my personal favourites are Claude and Deepseek via API.

replies(3): >>45774391 #>>45775753 #>>45777164 #
1. redblacktree ◴[] No.45774391[source]
Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.
replies(2): >>45775085 #>>45775717 #
2. tommy_axle ◴[] No.45775085[source]
Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.
3. dust42 ◴[] No.45775717[source]
I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.

On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.

To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167

About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...

And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.