Most active commenters
  • zer00eyz(4)
  • oblio(4)
  • viraptor(3)

←back to thread

600 points antirez | 48 comments | | HN request time: 0.002s | source | bottom
Show context
dakiol ◴[] No.44625484[source]
> Gemini 2.5 PRO | Claude Opus 4

Whether it's vibe coding, agentic coding, or copy pasting from the web interface to your editor, it's still sad to see the normalization of private (i.e., paid) LLM models. I like the progress that LLMs introduce and I see them as a powerful tool, but I cannot understand how programmers (whether complete nobodies or popular figures) dont mind adding a strong dependency on a third party in order to keep programming. Programming used to be (and still is, to a large extent) an activity that can be done with open and free tools. I am afraid that in a few years, that will no longer be possible (as in most programmers will be so tied to a paid LLM, that not using them would be like not using an IDE or vim nowadays), since everyone is using private LLMs. The excuse "but you earn six figures, what' $200/month to you?" doesn't really capture the issue here.

replies(46): >>44625521 #>>44625545 #>>44625564 #>>44625827 #>>44625858 #>>44625864 #>>44625902 #>>44625949 #>>44626014 #>>44626067 #>>44626198 #>>44626312 #>>44626378 #>>44626479 #>>44626511 #>>44626543 #>>44626556 #>>44626981 #>>44627197 #>>44627415 #>>44627574 #>>44627684 #>>44627879 #>>44628044 #>>44628982 #>>44629019 #>>44629132 #>>44629916 #>>44630173 #>>44630178 #>>44630270 #>>44630351 #>>44630576 #>>44630808 #>>44630939 #>>44631290 #>>44632110 #>>44632489 #>>44632790 #>>44632809 #>>44633267 #>>44633559 #>>44633756 #>>44634841 #>>44635028 #>>44636374 #
1. simonw ◴[] No.44626556[source]
The models I can run locally aren't as good yet, and are way more expensive to operate.

Once it becomes economical to run a Claude 4 class model locally you'll see a lot more people doing that.

The closest you can get right now might be Kimi K2 on a pair of 512GB Mac Studios, at a cost of about $20,000.

replies(12): >>44627184 #>>44627617 #>>44627695 #>>44627852 #>>44628143 #>>44631034 #>>44631098 #>>44631352 #>>44631995 #>>44632684 #>>44633226 #>>44644288 #
2. ◴[] No.44627184[source]
3. QRY ◴[] No.44627617[source]
Have you considered the Framework Desktop setup they mentioned in their announcement blog post[0]? Just marketing fluff, or is there any merit to it?

> The top-end Ryzen AI Max+ 395 configuration with 128GB of memory starts at just $1999 USD. This is excellent for gaming, but it is a truly wild value proposition for AI workloads. Local AI inference has been heavily restricted to date by the limited memory capacity and high prices of consumer and workstation graphics cards. With Framework Desktop, you can run giant, capable models like Llama 3.3 70B Q6 at real-time conversational speed right on your desk. With USB4 and 5Gbit Ethernet networking, you can connect multiple systems or Mainboards to run even larger models like the full DeepSeek R1 671B.

I'm futsing around with setups, but adding up the specs would give 384GB of VRAM and 512GB total memory, at a cost of about $10,000-$12,000. This is all highly dubious napkin math, and I hope to see more experimentation in this space.

There's of course the moving target of cloud costs and performance, so analysing break-even time is even more precarious. So if this sort of setup would work, its cost-effectiveness is a mystery to me.

[0] https://frame.work/be/en/blog/introducing-the-framework-desk...

replies(6): >>44627826 #>>44628517 #>>44629688 #>>44629702 #>>44631163 #>>44632389 #
4. zer00eyz ◴[] No.44627695[source]
> Once it becomes economical to run a Claude 4 class model locally you'll see a lot more people doing that.

Historically these sorts of things happened because of Moores law. Moores law is dead. For a while we have scaled on the back of "more cores", and process shrink. It looks like we hit the wall again.

We seem to be near the limit of scaling (physics) we're not seeing a lot in clock (some but not enough), and IPC is flat. We are also having power (density) and cooling (air wont cut it any more) issues.

The requirements to run something like claud 4 local aren't going to make it to house hold consumers any time soon. Simply put the very top end of consumer PC's looks like 10 year old server hardware, and very few people are running that because there isn't a need.

The only way we're going to see better models locally is if there is work (research, engineering) put into it. To be blunt that isnt really happening, because Fb/MS/Google are scaling in the only way they know how. Throw money at it to capture and dominate the market, lock out the innovators from your API and then milk the consumer however you can. Smaller, and local is antithetical to this business model.

Hoping for the innovation that gives you a moat, that makes you the next IBM isnt the best way to run a business.

Based on how often Google cancels projects, based on how often the things Zuck swear are "next" face plant (metaverse) one should not have a lot of hope about AI>

replies(3): >>44627840 #>>44628024 #>>44630780 #
5. cheeze ◴[] No.44627826[source]
I love Framework but it's still not enough IMO. My time is the most valuable thing, and a subscription to $paid_llm_of_choice is _cheap_ relative to my time spent working.

In my experience, something Llama 3.3 works really well for smaller tasks. For "I'm lazy and want to provide minimal prompting for you to build a tool similar to what is in this software package already", paid LLMs are king.

If anything, I think the best approach for free LLMs would be to run using rented GPU capacity. I feel bad knowing that I have a 4070ti super that sits idle for 95% of the time. I'd rather share an a1000 with bunch of folks and have that run at close to max utilization.

replies(2): >>44628162 #>>44633970 #
6. esafak ◴[] No.44627840[source]
Model efficiency is outpacing Moore's law. That's what DeepSeek V3 was about. It's just we're simultaneously finding ways to use increase model capacity, and that's growing even faster...
replies(1): >>44628211 #
7. smallerize ◴[] No.44627852[source]
I don't have to actually run it locally to remove lock-in. Several cloud providers offer full DeepSeek R1 or Kimi K2 for $2-3/million output tokens.
replies(1): >>44629572 #
8. mleo ◴[] No.44628024[source]
Why wouldn’t 3rd party hardware vendors continue to work on reducing costs of running models locally? If there is a market opportunity for someone to make money, it will be filled. Just because the cloud vendors don’t develop hardware someone will. Apple has vested interest in making hardware to run better models locally, for example.
replies(1): >>44629088 #
9. oblio ◴[] No.44628143[source]
The thing is, code is quite compact. Why do LLMs need to train on content bigger than the size of the textual internet to be effective?

Total newb here.

replies(1): >>44628343 #
10. generic92034 ◴[] No.44628162{3}[source]
> and a subscription to $paid_llm_of_choice is _cheap_ relative to my time spent working.

In the mid to long term the question is, is the subscription covering the costs of the LLM provider. Current costs might not be stable for long.

replies(1): >>44638566 #
11. zer00eyz ◴[] No.44628211{3}[source]
> Model efficiency is outpacing Moore's law.

Moores law is dead, has been for along time. There is nothing to outpace.

> That's what DeepSeek V3 was about.

This would be a foundational shift! What problem in complexity theory was solved that the rest of computing missed out on?

Don't get me wrong MOE is very interesting but breaking up one large model into independent chunks isn't a foundational breakthrough its basic architecture. It's 1960's time sharing on unix basics. It's decomposition of your application basics.

All that having been said, there is a ton of room for these sorts of basic, blood and guts engineering ideas to make systems more "portable" and "usable". But a shift in thinking to small, targeted and focused will have to happen. Thats antithetical to everything in one basket throw more compute at it and magically we will get to AGI. That clearly isnt the direction the industry is going... it wont give any one a moat, or market dominance.

replies(2): >>44628648 #>>44629754 #
12. airspresso ◴[] No.44628343[source]
Many reasons, one being that LLMs are essentially compressing the training data to unbelievably small data volumes (the weights). When doing so, they can only afford to keep the general principles and semantic meaning of the training data. Bigger models can memorize more than smaller ones of course, but are still heavily storage limited. Through this process they become really good at semantic understanding of code and language in general. It takes a certain scale of training data to achieve that.
replies(1): >>44629047 #
13. lhl ◴[] No.44628517[source]
Strix Halo does not run a 70B Q6 dense model at real-time conversational speed - it has a real-world MBW of about 210 GB/s. A 40GB Q4 will clock just over 5 tok/s. A Q6 would be slower.

It will run some big MoEs at a decent speed (eg, Llama 4 Scout 109B-A17B Q4 at almost 20 tok/s). The other issue is its prefill - only about 200 tok/s due to having only very under-optimized RDNA3 GEMMs. From my testing, you usually have to trade off pp for tg.

If you are willing to spend $10K for hardware, I'd say you are much better off w/ EPYC and 12-24 channels of DDR5, and a couple fast GPUS for shared experts and TFLOPS. But, unless you are doing all-night batch processing, that $10K is probably better spent on paying per token or even renting GPUs (especially when you take into account power).

Of course, there may be other reasons you'd want to inference locally (privacy, etc).

replies(1): >>44628610 #
14. moffkalast ◴[] No.44628610{3}[source]
Yeah it's only really viable for chat use cases, coding is the most demanding in terms of generation speed, to keep the workflow usable it needs to spit out corrections in seconds, not minutes.

I use local LLMs as much as possible myself, but coding is the only use case where I still entirely defer to Claude, GPT, etc. because you need both max speed and bleeding edge model intelligence for anything close to acceptable results. When Qwen-3-Coder lands + having it on runpod might be a low end viable alternative, but likely still a major waste of time when you actually need to get something done properly.

15. moron4hire ◴[] No.44628648{4}[source]
I agree with you that Moore's Law being dead means we can't expect much more from current, silicon-based GPU compute. Any improvement from hardware alone is going to have to come from completely new compute technology, of which I don't think there is anything mature enough to expect any results in the next 10 years.

Right now, hardware wise, we need more RAM in GPUs than we really need compute. But it's a breakpoint issue: you need enough RAM to hold the model. More RAM that is less than the model is not going to improve things much. More RAM that is more than the model is largely dead weight.

I don't think larger models are going to show any major inference improvements. They hit the long tail of diminishing returns re: model training vs quality of output at least 2 years ago.

I think the best anyone can hope for in optimizing current LLM technology is improve the performance of inference engines, and there at most I can imagine only about a 5x improvement. That would be a really long tail of performance optimizations that would take at least a decade to achieve. In the 1 to 2 year timeline, I think the best that could be hoped for is a 2x improvement. But I think we may have already seen much of the low hanging optimization fruit already picked, and are starting to turn the curve into that long tail of incremental improvements.

I think everyone betting on LLMs improving the performance of junior to mid level devs and that leading to a Renaissance of software development speed is wildly over optimistic as to the total contribution to productivity those developers already represent. Most of the most important features are banged out by harried, highly skilled senior developers. Most everyone else is cleaning up around the edges of that. Even a 2 or 3x improvement of the bottom 10% of contributions is only going to grow the pie just so much. And I think these tools are basically useless to skilled senior devs. All this "boilerplate" code folks keep cheering the AI is writing for them is just not that big of a deal. 15 minutes of savings once a month.

But I see how this technology works and what people are asking it to do (which in my company is basically "all the hard work that you already weren't doing, so how are you going to even instruct an LLM to do it if you don't really know how to do it?") and there is such a huge gap between the two that I think it's going to take at least a 100x improvement to get there.

I can't see AI being all that much of an improvement on productivity. It still gives wrong results too many times. The work needed to make it give good results is the same sort of work we should have been doing already to be able to leverage classical ML systems with more predictable performance and output. We're going to spend trillions as an industry trying to chase AI that will only end up being an exercise in making sure documents are stored in a coherent, searchable way. At which point, why not do just that and avoid having to pressure the energy industry to firing up a bunch of old coal plants to meet demand?

replies(1): >>44631767 #
16. oblio ◴[] No.44629047{3}[source]
Yeah, I just asked Gemini and apparently some older estimates put a relatively filtered dataset of Github source code at around 21TB in 2018, and some more recent estimates could put it in the low hundreds of TB.

Considering as you said, that LLMs are doing a form of compression, and assuming generously that you add extra compression on top, yeah, now I understand a bit more. Even if you focus on non-similar code to get the most coverage, I wouldn't be shocked if a modern, representative source code training data from Github weighed 1TB, which obviously is a lot more than consumer grade hardware can bear.

I guess we need to ramp up RAM production a bunch more :-(

Speaking of which, what's the next bottle neck except for storing the damned things? Training needs a ton of resources but that part can be pooled, even for OSS models, it "just" need to be done "once", and then the entire community can use the data set. So I guess inference is the scaling cost, what's the most used resource there? Data bandwidth for RAM?

replies(1): >>44652673 #
17. zer00eyz ◴[] No.44629088{3}[source]
> Why wouldn’t 3rd party hardware vendors continue to work on reducing costs of running models locally?

Every one wants this to happen they are all trying but...

EUV, what has gotten us down to 3nm and less is HARD. Reduction in chip size has lead to increases in density and lower costs. But now yields are DOWN and the design concessions to make the processes work are hurting costs and performance. There are a lot of hopes and prayers in the 1.8 nodes but things look grim.

Power is a massive problem for everyone. It is a MASSIVE a problem IN the data center and it is a problem for GPU's at home. Considering that locally is a PHONE for most people it's an even bigger problem. With all this power comes cooling issues. The industry is starting to look at all sorts of interesting ways to move heat away from cores... ones that don't involve air.

Design has hit a wall as well. If you look at NVIDIA's latest offering its IPC, (thats Instructions Per Clock cycle) you will find they are flat. The only gains between the latest generation and previous have come from small frequency upticks. These gains came from using "more power!!!", and thats a problem because...

Memory is a problem. There is a reason that the chips for GPU's are soldered on to the boards next to the processors. There is a reason that laptops have them soldered on too. CAMM try's to fix some of this but the results are, to say the least, disappointing thus far.

All of this has been hitting cpu's slowly, but we have also had the luxury of "more cores" to throw at things. If you go back 10-15 years a top end server is about the same as a top end desktop today (core count, single core perf). Because of all of the above issues I don't think you are going to get 700+ core consumer desktops in a decade (current high end for server CPU)... because of power, costs etc.

Unless we see some foundational breakthrough in hardware (it could happen), you wont see the normal generational lift in performance that you have in the past (and I would argue that we already haven't been seeing that). Someone is going to have to make MAJOR investments in the software side, and there is NO MOAT by doing so. Simply put it's a bad investment... and if we can't lower the cost of compute (and it looks like we can't) its going to be hard for small players to get in and innovate.

It's likely you're seeing a very real wall.

18. ketzo ◴[] No.44629572[source]
In what ways is that better for you than using eg Claude? Aren’t you then just “locked in” to having a cloud provider which offers those models cheaply?
replies(1): >>44629709 #
19. zackify ◴[] No.44629688[source]
The memory bandwidth is crap and you’ll never run anything close to Claude on that unfortunately. They should have shipped something 8x faster at least 2 tb/s bandwidth
20. smcleod ◴[] No.44629702[source]
The framework desktop isn't really that compelling for work with LLMs, it's memory bandwidth is very low compared to GPUs and Apple Silicon Max/Ultra chips - you'd really notice how slow LLMs are on it to the point of frustration. Even a 2023 Macbook Pro with a M2 Max chip has twice the usable bandwidth.
21. viraptor ◴[] No.44629709{3}[source]
Any provider can run Kimi (including yourself if you would get enough use out of it), but only one can run Claude.
replies(1): >>44633739 #
22. viraptor ◴[] No.44629754{4}[source]
> What problem in complexity theory was solved

None. We're still in the "if you spend enough effort you can make things less bad" era of LLMs. It will be a while before we even find out what are the theoretical limits in that area. Everyone's still running on roughly the same architecture after all - big corps haven't even touched recursive LLMs yet!

23. Aurornis ◴[] No.44630780[source]
> We seem to be near the limit of scaling (physics) we're not seeing a lot in clock (some but not enough), and IPC is flat. We are also having power (density) and cooling (air wont cut it any more) issues.

This is exaggeration. CPUs are still getting faster. IPC is increasing, not flat. Cooling on air is fine unless you’re going for high density or low noise.

This is just cynicism. Even an M4 MacBook Pro is substantially faster than an M1 from a few years ago, which is substantially faster than the previous versions.

Server chips are scaling core counts and bandwidth. GPUs are getting faster and faster.

The only way you could conclude scaling is dead is if you ignored all recent progress or you’re expecting improvements at an unrealistically fast rate.

replies(1): >>44640912 #
24. NiloCK ◴[] No.44631034[source]
> Once it becomes economical to run a Claude 4 class model locally you'll see a lot more people doing that.

By that time Claude 5 (or whatever) will be available over API.

I am grateful for upward pressure from models with published binaries - I do believe this is fundamental floor-raising technology.

Choosing frontier-1 for the sake of privacy, autonomy, etc will always be a hard sell and only ever to a pretty niche market. Even me - I'm ideologically part of this market, but I'm already priced out hardware wise.

25. mythz ◴[] No.44631098[source]
Whilst it's not economically feasible to self-host, using premier OSS models like Kimi K2 / DeepSeek via OpenRouter gets you a great price with a fallback safety net of being able to self-host should the proprietary model Co's collude and try and squeeze more ROI out of us. Hopefully by then the hardware to run the OSS models will be a lot cheaper to run.
26. komali2 ◴[] No.44631163[source]
They demo'd it live at computex and it was slooooow. Like two characters a second slow. Iirc he had 4 machines clustered.
27. jmb99 ◴[] No.44631352[source]
What’s your budget and speed requirement? A quad-CPU Xeon E7 v4 server (Supermicro X10QBI, for example) with 1TB of RAM gives you ~340GB/s memory bandwidth and enough actual memory to host a full DeepSeek instance, but it will be relatively slow (a few tokens/s max in my experience). Up front cost a bit under $1k, less if you can source cheap 32GB DDR3 RAM. Power consumption is relatively high, ~1kW under load. But I don’t think you can self host a large model cheaper than that.

(If you need even more memory you could equip one of those servers with 6TB of DDR3 but you’ll lose a bit of bandwidth if you go over 2TB. DDR4 is also a slightly faster option but you’re spending 4x as much for the same capacity.)

replies(2): >>44631449 #>>44644474 #
28. pmarreck ◴[] No.44631449[source]
This would require massively more power than the Mac Studios.
replies(1): >>44637427 #
29. bluefirebrand ◴[] No.44631767{5}[source]
> And I think these tools are basically useless to skilled senior devs. All this "boilerplate" code folks keep cheering the AI is writing for them is just not that big of a deal. 15 minutes of savings once a month

Yep... Copy and paste with find and replace already had the boilerplate code covered

30. otabdeveloper4 ◴[] No.44631995[source]
a) Rent a GPU server. b) Learn to finetune your models. You're a programmer, right? Whatever happened to knowing your tools?

OP is right, these people are posers and fakers, not programmers.

replies(1): >>44633717 #
31. oblio ◴[] No.44632389[source]
One second, don't LLMs generally run in VRAM? If you put them in regular RAM, don't they have to go through the CPU which kills performance?
replies(1): >>44632409 #
32. pxeger1 ◴[] No.44632409{3}[source]
The mentioned CPU uses unified memory for its built in GPU / NPU. I.e. some portion of what could ordinarily be system RAM is given to the GPU instead of the CPU
replies(1): >>44640365 #
33. Abishek_Muthian ◴[] No.44632684[source]
What type of code you write for which the opensource models aren't good enough?

I use Qwen2.5 coder for auto complete and occasional chat. I don't want AI to edit my code and so this works well for me.

I agree that the hardware investment for local AI is steep but IMO the local models are good enough for most experienced coders who just want a better autocomplete than the one provided by the IDE by default.

replies(2): >>44632755 #>>44632804 #
34. arvinsim ◴[] No.44632755[source]
The people who are subscribing to private LLM models are not doing it for better autocomplete. These are the people who want more features like agents.
35. InterviewFrog ◴[] No.44632804[source]
Using AI for autocomplete is like using a racecar to pick up groceries. This is exactly what the author says about avoiding LLMs for some ideological or psychological refusal.
replies(1): >>44646215 #
36. poulpy123 ◴[] No.44633226[source]
Yeah. I cannot even run significantly worse models on any machine I have at home.
37. simonw ◴[] No.44633717[source]
Have you had any success finetuning models? What did you do?
replies(1): >>44646290 #
38. hhh ◴[] No.44633739{4}[source]
Two can run Claude, AWS and Anthropic. Claude rollout on AWS is pretty good, but they do some weird stuff in estimating your quota usage thru your max_tokens parameter.

I trust AWS, but we also pay big bucks to them and have a reason to trust them.

replies(1): >>44641572 #
39. jmb99 ◴[] No.44637427{3}[source]
Yep, ~1kW as mentioned. Depending on your electrical rate, break even might be years down the line. And obviously the Mac Studios would perform substantially better.

Edit: And also, to get even half as much memory, you need to spend $10k. If you want to host actually-large LLMs (not quantized/distilled versions), you'll need to spend close to that much. Maybe you can get away with 256GB for now, but that won't even host full Deepseek now (and I don't know if 512GB either, with OS/etc overhead, and a large context window).

40. RugnirViking ◴[] No.44638566{4}[source]
That is, in every sense of the term, their problem.

I will switch to whatever is best for me at a good price, and if thats not sustainable then I'll be fine too; I was a developer before these existed at all, and local models only help from there.

41. oblio ◴[] No.44640365{4}[source]
Ah, now I see, didn't know that was feasible in the PC world. Glad that it's becoming an option.
42. zer00eyz ◴[] No.44640912{3}[source]
> IPC is increasing, not flat.

Benchmarks going up is not IPC increasing. These are separate things.

Please look IPC for the latest GPU's from Nvidia, the latest CPU's from AMD. The IPC is flat. See intel loosing credibility with failing processors due to power problems from clocking because IPC is flat.

> Even an M4 MacBook Pro is substantially faster than an M1

Again, clocking. m4 (non pro) vs m1 are so close in IPC on common tasks that its negligible. The performance gains between the two are from memory bandwidth not core performance.

> Server chips are scaling core counts

Parallelism is not the same as performance. Intel dropping the "core duo" 20 year ago was that RUNNING at 2ghz was an admission that single threading was ending. 20 years on were 20 cores deep (consumer), and only at 4ghz with "boost clocks" (back to that pesky power and cooling problem).

And this product still exists today: the N150 (close enough). Its has lower power consumption and more cores. And what was the single core performance gain? 35% Improvement in 20 years.

None of these things are running any of the LLM's that power the tools were talking about. Those are in the datacenter. 700 core CPU's, 400-800gbps top of rack switching are the bleeding edge. This is where "power" and cooling have hit the wall. The spacing requirements of a bleeding edge NVIDIA install are impacting the costs of interconnect between systems. Lots of fiber and needing to be spaced out because of power/heat adds up to a boat load of extra networking costs. Having half empty racks because of density is now a reality.

And you see these same issues at home: power demands of GPU's for consumers and workstations are thought he roof. Were past what the PCI spec can provide, all that power is heat and has to go somewhere. Sometimes it burns up poorly designed connectors. The latest gen is consumes even more power, to push clocks higher, for very little gain (see flat IPC nvida).

43. viraptor ◴[] No.44641572{5}[source]
In a way... But it's still just because Anthropic lets them. Things can change at any point.
44. theshrike79 ◴[] No.44644288[source]
This is the thing. I'm waiting for the equivalent of Google Coral[0], but powerful enough for AI workloads.

You can plug in the $60 Coral to a Raspberry Pi and get real-time image recognition running in Frigate.

When I can have:

1) Something similar inside my computer/laptop

2) Something I can plug in to my computer via USB-C

3) Something I can buy and install to my LAN so all devices in my home can connect to it

I'll buy it instantly.

What I don't want is a massive generic GPU that just happens to be good at AI workloads, I want custom hardware that's more efficient and cheaper.

(Off topic, but my guess is that Apple is aiming for #3 with an Apple TV variant so you can have more power than your phone, but still keep it 100% local)

[0] https://coral.ai/products/accelerator/

45. theshrike79 ◴[] No.44644474[source]
I think we're at early 2000's bitcoin markets here.

People were buying stores empty of GPUs to mine for BTC.

Then people built custom ASICs that couldn't do anything but mine BTC, but did it a lot cheaper and with a lot less electricity required -> nobody GPU mines anymore pretty much.

I'm waiting for a similar thing to happen to local AI.

46. manmal ◴[] No.44646215{3}[source]
Nothing's wrong with using autocomplete in addition to agents.
47. otabdeveloper4 ◴[] No.44646290{3}[source]
Not yet. That day will come though.
48. airspresso ◴[] No.44652673{4}[source]
Yes, for inference the main bottleneck is GPU VRAM and the bandwidth between the GPU cores and VRAM. Ideally you want enough GPU VRAM to be able to load the entire model into VRAM + have room for caching the already-produced output in VRAM when you're generating output tokens. And fast enough VRAM bandwidth that you can copy the weights from VRAM to GPU compute cores as fast as possible to do the calculations for each token. This determines the tokens/sec speed you get for the output. So yes, more and faster VRAM is essential.