←back to thread

600 points antirez | 8 comments | | HN request time: 0s | source | bottom
Show context
dakiol ◴[] No.44625484[source]
> Gemini 2.5 PRO | Claude Opus 4

Whether it's vibe coding, agentic coding, or copy pasting from the web interface to your editor, it's still sad to see the normalization of private (i.e., paid) LLM models. I like the progress that LLMs introduce and I see them as a powerful tool, but I cannot understand how programmers (whether complete nobodies or popular figures) dont mind adding a strong dependency on a third party in order to keep programming. Programming used to be (and still is, to a large extent) an activity that can be done with open and free tools. I am afraid that in a few years, that will no longer be possible (as in most programmers will be so tied to a paid LLM, that not using them would be like not using an IDE or vim nowadays), since everyone is using private LLMs. The excuse "but you earn six figures, what' $200/month to you?" doesn't really capture the issue here.

replies(46): >>44625521 #>>44625545 #>>44625564 #>>44625827 #>>44625858 #>>44625864 #>>44625902 #>>44625949 #>>44626014 #>>44626067 #>>44626198 #>>44626312 #>>44626378 #>>44626479 #>>44626511 #>>44626543 #>>44626556 #>>44626981 #>>44627197 #>>44627415 #>>44627574 #>>44627684 #>>44627879 #>>44628044 #>>44628982 #>>44629019 #>>44629132 #>>44629916 #>>44630173 #>>44630178 #>>44630270 #>>44630351 #>>44630576 #>>44630808 #>>44630939 #>>44631290 #>>44632110 #>>44632489 #>>44632790 #>>44632809 #>>44633267 #>>44633559 #>>44633756 #>>44634841 #>>44635028 #>>44636374 #
1. KronisLV ◴[] No.44633756[source]
The software is largely there: you can run Ollama, vLLM or whatever else you please today.

The models are somewhat getting there: even the smaller ones like Qwen3-30B-A3B and Devstral-23B are okay for some use cases and can run decently fast. They’re not amazing, but better than much larger models a year or two ago.

The hardware is absolutely not there: most development laptops will be too weak to run a bunch of tools, IDEs and local services alongside a LLM and will struggle to do everything at the pace of those cloud services.

Even if you seek compromise and get a pair of Nvidia L4 cards or something similar and put them on a server somewhere, the aforementioned Qwen3-30B-A3B will run at around 60 tokens/second for a single query but slow down as you throw a bunch of developers at it that all need chat and autocomplete. The smaller Devstral model will more than halve the performance at the starting point because it’s dense.

Tools like GitHub Copilot allow an Ollama connection pretty easily, Continue.dev also does but can be a bit buggy (their VS Code implementation is better than their JetBrains one), whereas the likes of RooCode only seem viable with cloud models cause they generate large system prompts and need more performance than you can squeeze out of somewhat modest hardware.

That said, with more MoE models and better training, things seem hopeful. Just look at the recent ERNIE-4.5 release, their model is a bit smaller than Qwen3 but has largely comparable benchmark results.

Those Intel Arc Pro B60 cards can’t come soon enough. Someone needs to at least provide a passable alternative to Nvidia, nothing more.

replies(3): >>44633802 #>>44634098 #>>44635589 #
2. stingraycharles ◴[] No.44633802[source]
And models like Qwen3 really don’t match the quality of Opus 4 and Gemini-Pro 2.5. And even if you manage to get your hands on 512GB of GPU RAM, it will be slow.

There’s simply so much going on under the hood at these LLM providers that are very hard to replicate locally.

replies(1): >>44634366 #
3. amelius ◴[] No.44634098[source]
Didn't Karpathy, in his latest talk, say something along the lines of: don't bother with less capable models, they are just a waste of time.
replies(1): >>44635102 #
4. justatdotin ◴[] No.44634366[source]
its impossible to catch up; but there is still much fertile and prospective territory within reach
5. loudmax ◴[] No.44635102[source]
It probably depends what your objective is. One of the benefits you get from running less capable models is that it's easier to understand what their limitations are. The shortcomings of more powerful models are harder to see and understand, because the models themselves are so much more capable.

If you have no interest in the inner workings of LLMs and you just want the machine to spit out some end result while putting in minimal time and effort, then yes, absolutely don't waste your time with smaller, less capable models.

replies(1): >>44637800 #
6. wizee ◴[] No.44635589[source]
On my M4 Max MacBook Pro, with MLX, I get around 70-100 tokens/sec for Qwen 3 30B-A3B (depending on context size), and around 40-50 tokens/sec for Qwen 3 14B. Of course they’re not as good as the latest big models (open or closed), but they’re still pretty decent for STEM tasks, and reasonably fast for me.

I have 128 GB RAM on my laptop, and regularly run multiple multiple VMs and several heavy applications and many browser tabs alongside LLMs like Qwen 3 30B-A3B.

Of course there’s room for hardware to get better, but the Apple M4 Max is a pretty good platform running local LLMs performantly on a laptop.

7. amelius ◴[] No.44637800{3}[source]
Is it really possible to learn from the mistakes of an LLM? It sounds like psychology or even alchemy 2.0, to be honest.
replies(1): >>44644491 #
8. theshrike79 ◴[] No.44644491{4}[source]
You can kinda get a feel for what they're good at, if you get what I mean?

Even the big online models have very specific styles and preferences for similar tasks. You can easily test this by giving them all some generic task without too many limits, each of them will gravitate towards a different solution to the same problem.