←back to thread

600 points antirez | 1 comments | | HN request time: 0.38s | source
Show context
dakiol ◴[] No.44625484[source]
> Gemini 2.5 PRO | Claude Opus 4

Whether it's vibe coding, agentic coding, or copy pasting from the web interface to your editor, it's still sad to see the normalization of private (i.e., paid) LLM models. I like the progress that LLMs introduce and I see them as a powerful tool, but I cannot understand how programmers (whether complete nobodies or popular figures) dont mind adding a strong dependency on a third party in order to keep programming. Programming used to be (and still is, to a large extent) an activity that can be done with open and free tools. I am afraid that in a few years, that will no longer be possible (as in most programmers will be so tied to a paid LLM, that not using them would be like not using an IDE or vim nowadays), since everyone is using private LLMs. The excuse "but you earn six figures, what' $200/month to you?" doesn't really capture the issue here.

replies(46): >>44625521 #>>44625545 #>>44625564 #>>44625827 #>>44625858 #>>44625864 #>>44625902 #>>44625949 #>>44626014 #>>44626067 #>>44626198 #>>44626312 #>>44626378 #>>44626479 #>>44626511 #>>44626543 #>>44626556 #>>44626981 #>>44627197 #>>44627415 #>>44627574 #>>44627684 #>>44627879 #>>44628044 #>>44628982 #>>44629019 #>>44629132 #>>44629916 #>>44630173 #>>44630178 #>>44630270 #>>44630351 #>>44630576 #>>44630808 #>>44630939 #>>44631290 #>>44632110 #>>44632489 #>>44632790 #>>44632809 #>>44633267 #>>44633559 #>>44633756 #>>44634841 #>>44635028 #>>44636374 #
KronisLV ◴[] No.44633756[source]
The software is largely there: you can run Ollama, vLLM or whatever else you please today.

The models are somewhat getting there: even the smaller ones like Qwen3-30B-A3B and Devstral-23B are okay for some use cases and can run decently fast. They’re not amazing, but better than much larger models a year or two ago.

The hardware is absolutely not there: most development laptops will be too weak to run a bunch of tools, IDEs and local services alongside a LLM and will struggle to do everything at the pace of those cloud services.

Even if you seek compromise and get a pair of Nvidia L4 cards or something similar and put them on a server somewhere, the aforementioned Qwen3-30B-A3B will run at around 60 tokens/second for a single query but slow down as you throw a bunch of developers at it that all need chat and autocomplete. The smaller Devstral model will more than halve the performance at the starting point because it’s dense.

Tools like GitHub Copilot allow an Ollama connection pretty easily, Continue.dev also does but can be a bit buggy (their VS Code implementation is better than their JetBrains one), whereas the likes of RooCode only seem viable with cloud models cause they generate large system prompts and need more performance than you can squeeze out of somewhat modest hardware.

That said, with more MoE models and better training, things seem hopeful. Just look at the recent ERNIE-4.5 release, their model is a bit smaller than Qwen3 but has largely comparable benchmark results.

Those Intel Arc Pro B60 cards can’t come soon enough. Someone needs to at least provide a passable alternative to Nvidia, nothing more.

replies(3): >>44633802 #>>44634098 #>>44635589 #
1. wizee ◴[] No.44635589[source]
On my M4 Max MacBook Pro, with MLX, I get around 70-100 tokens/sec for Qwen 3 30B-A3B (depending on context size), and around 40-50 tokens/sec for Qwen 3 14B. Of course they’re not as good as the latest big models (open or closed), but they’re still pretty decent for STEM tasks, and reasonably fast for me.

I have 128 GB RAM on my laptop, and regularly run multiple multiple VMs and several heavy applications and many browser tabs alongside LLMs like Qwen 3 30B-A3B.

Of course there’s room for hardware to get better, but the Apple M4 Max is a pretty good platform running local LLMs performantly on a laptop.