Most active commenters
  • diggan(9)
  • chisleu(8)
  • seanmcdirmid(5)
  • sneak(4)
  • storus(4)
  • prophesi(3)

←back to thread

MCP in LM Studio

(lmstudio.ai)
240 points yags | 79 comments | | HN request time: 2.233s | source | bottom
1. chisleu ◴[] No.44380098[source]
Just ordered a $12k mac studio w/ 512GB of integrated RAM.

Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.

LM Studio is newish, and it's not a perfect interface yet, but it's fantastic at what it does which is bring local LLMs to the masses w/o them having to know much.

There is another project that people should be aware of: https://github.com/exo-explore/exo

Exo is this radically cool tool that automatically clusters all hosts on your network running Exo and uses their combined GPUs for increased throughput.

Like HPC environments, you are going to need ultra fast interconnects, but it's just IP based.

replies(15): >>44380196 #>>44380217 #>>44380386 #>>44380596 #>>44380626 #>>44380956 #>>44381072 #>>44381075 #>>44381174 #>>44381177 #>>44381267 #>>44385069 #>>44386056 #>>44387384 #>>44393032 #
2. dchest ◴[] No.44380196[source]
I'm using it on MacBook Air M1 / 8 GB RAM with Qwen3-4B to generate summaries and tags for my vibe-coded Bloomberg Terminal-style RSS reader :-) It works fine (the laptop gets hot and slow, but fine).

Probably should just use llama.cpp server/ollama and not waste a gig of memory on Electron, but I like GUIs.

replies(1): >>44380381 #
3. karmakaze ◴[] No.44380217[source]
Nice. Ironically well suited for non-Apple Intelligence.
4. minimaxir ◴[] No.44380381[source]
8 GB of RAM with local LLMs in general is iffy: a 8-bit quantized Qwen3-4B is 4.2GB on disk and likely more in memory. 16 GB is usually the minimum to be able to run decent models without compromising on heavy quantization.
replies(2): >>44382797 #>>44385257 #
5. incognito124 ◴[] No.44380386[source]
> I'm going to download it with Safari

Oof you were NOT joking

replies(1): >>44381086 #
6. sneak ◴[] No.44380596[source]
I already got one of these. I’m spoiled by Claude 4 Opus; local LLMs are slower and lower quality.

I haven’t been using it much. All it has on it is LM Studio, Ollama, and Stats.app.

> Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.

lol, yup. same.

replies(1): >>44380720 #
7. teaearlgraycold ◴[] No.44380626[source]
What are you going to do with the LLMs you run?
replies(1): >>44380685 #
8. chisleu ◴[] No.44380685[source]
Currently I'm using gemini 2.5 and claude 3.7 sonnet for coding tasks.

I'm interested in using models for code generation, but I'm not expecting much in that regard.

I'm planning to attempt fine tuning open source models on certain tool sets, especially MCP tools.

9. chisleu ◴[] No.44380720[source]
Yup, I'm spoiled by Claude 3.7 Sonnet right now. I had to stop using opus for plan mode in my Agent because it is just so expensive. I'm using Gemini 2.5 pro for that now.

I'm considering ordering one of these today: https://www.newegg.com/p/N82E16816139451?Item=N82E1681613945...

It looks like it will hold 5 GPUs with a single slot open for infiniband

Then local models might be lower quality, but it won't be slow! :)

replies(3): >>44381101 #>>44382010 #>>44384667 #
10. prettyblocks ◴[] No.44380956[source]
I've been using openwebui and am pretty happy with it. Why do you like lm studio more?
replies(3): >>44381042 #>>44381073 #>>44381909 #
11. truemotive ◴[] No.44381042[source]
Open WebUI can leverage the built in web server in LM Studio, just FYI in case you thought it was primarily a chat interface.
12. noman-land ◴[] No.44381072[source]
I love LM Studio. It's a great tool. I'm waiting for another generation of Macbook Pros to do as you did :).
13. prophesi ◴[] No.44381073[source]
Not OP, but with LM Studio I get a chat interface out-of-the-box for local models, while with openwebui I'd need to configure it to point to an OpenAI API-compatible server (like LM Studio). It can also help determine which models will work well with your hardware.

LM Studio isn't FOSS though.

I did enjoy hooking up OpenWebUI to Firefox's experimental AI Chatbot. (browser.ml.chat.hideLocalhost to false, browser.ml.chat.provider to localhost:${openwebui-port})

14. imranq ◴[] No.44381075[source]
I'd love to host my own LLMs but I keep getting held back from the quality and affordability of Cloud LLMs. Why go local unless there's private data involved?
replies(3): >>44383336 #>>44385249 #>>44388345 #
15. noman-land ◴[] No.44381086[source]
Safari to download LM Studio. LM Studio to download models. Models to download Firefox.
replies(1): >>44381629 #
16. kristopolous ◴[] No.44381101{3}[source]
The GPUs are the hard things to find unless you want to pay like 50% markup
replies(1): >>44384701 #
17. ◴[] No.44381174[source]
18. zackify ◴[] No.44381177[source]
I love LM studio but I’d never waste 12k like that. The memory bandwidth is too low trust me.

Get the RTX Pro 6000 for 8.5k with double the bandwidth. It will be way better

replies(6): >>44382823 #>>44382833 #>>44383071 #>>44386064 #>>44387179 #>>44407623 #
19. teaearlgraycold ◴[] No.44381629{3}[source]
The modern ninite
20. s1mplicissimus ◴[] No.44381909[source]
i recently tried openwebui but it was so painful to get it to run with local model. that "first run experience" of lm studio is pretty fire in comparison. can't really talk about actually working with it though, still waiting for the 8GB download
replies(1): >>44382953 #
21. evo_9 ◴[] No.44382010{3}[source]
I was using Claude 3.7 exclusively for coding, but it sure seems like it got worse suddenly about 2–3 weeks back. It went from writing pretty solid code I had to make only minor changes to, to being completely off its rails, altering files unrelated to my prompt, undoing fixes from the same conversation, reinventing db access and ignoring existing coding 'standards' established in the existing codebase. Became so untrustworthy I finally gave OpenAi O3 a try and honestly, I was pretty surprised how solid it has been. I've been using o3 since, and I find it generally does exactly what I ask, esp if you have a well established project with plenty of code for it to reference.

Just wondering if Claude 3.7 has seemed differently lately for anyone else? Was my go to for several months, and I'm no fan of OpenAI, but o3 has been rock solid.

replies(2): >>44383401 #>>44384695 #
22. hnuser123456 ◴[] No.44382797{3}[source]
But 8GB of Apple RAM is 16GB of normal RAM.

https://www.pcgamer.com/apple-vp-says-8gb-ram-on-a-macbook-p...

replies(2): >>44383813 #>>44383841 #
23. marci ◴[] No.44382823[source]
You can't run deepseek-v3/r1 on the RTX Pro 6000, not to mention the upcomming 1 million context qwen models, or the current qwen3-235b.
replies(1): >>44404092 #
24. tymscar ◴[] No.44382833[source]
Why would they pay 2/3 of the price for something with 1/5 of ram?

The whole point of spending that much money for them is to run massive models, like the full R1, which the Pro 6000 cant

replies(1): >>44383770 #
25. prettyblocks ◴[] No.44382953{3}[source]
Interesting. I run my local llms through ollama and it's zero trouble to get that working in openwebui as long as the ollama server is running.
replies(1): >>44386320 #
26. t1amat ◴[] No.44383071[source]
(Replying to both siblings questioning this)

If the primary use case is input heavy, which is true of agentic tools, there’s a world where partial GPU offload with many channels of DDR5 system RAM leads to an overall better experience. A good GPU will process input many times faster, and with good RAM you might end up with decent output speed still. Seems like that would come in close to $12k?

And there would be no competition for models that do fit entirely inside that VRAM, for example Qwen3 32B.

27. mycall ◴[] No.44383336[source]
Offline is another use case.
replies(1): >>44383597 #
28. jessmartin ◴[] No.44383401{4}[source]
Could be the prompt and/or tool descriptions in whatever tool you are using Claude in that degraded. Have definitely noticed variance across Cursor, Claude Code, etc even with the exact same models.

Prompts + tools matter.

replies(1): >>44385534 #
29. seanmcdirmid ◴[] No.44383597{3}[source]
Nothing like playing around with LLMs on an airplane without an internet connection.
replies(2): >>44383945 #>>44388368 #
30. zackify ◴[] No.44383770{3}[source]
Because waiting forever for initial prompt processing with realistic number of MCP tools enabled on a prompt is going to suck without the most bandwidth possible

And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

If you’re using it for background tasks and not coding it’s a different story

replies(6): >>44384804 #>>44385388 #>>44386018 #>>44386069 #>>44388078 #>>44407647 #
31. arrty88 ◴[] No.44383813{4}[source]
I concur. I just upgraded from m1 air with 8gb to m4 with 24gb. Excited to run bigger models.
replies(1): >>44386303 #
32. minimaxir ◴[] No.44383841{4}[source]
Interestingly it was AI (Apple Intelligence) that was the primary reason Apple abandoned that hedge.
33. asteroidburger ◴[] No.44383945{4}[source]
If I can afford a seat above economy with room to actually, comfortably work on a laptop, I can afford the couple bucks for wifi for the flight.
replies(2): >>44384251 #>>44388091 #
34. seanmcdirmid ◴[] No.44384251{5}[source]
If you are assuming that your Hainan airlines flight has wifi that isn't behind the GFW, even outside of cattle class, I have some news for you...
replies(1): >>44384457 #
35. sach1 ◴[] No.44384457{6}[source]
Getting around the GFW is trivially easy.
replies(1): >>44389173 #
36. sneak ◴[] No.44384667{3}[source]
I’m firehosing about $1k/mo at Cursor on pay-as-you-go and am happy to do it (it’s delivering 2-10k of value each month).

What cards are you gonna put in that chassis?

37. sneak ◴[] No.44384695{4}[source]
Me too. (re: Claude; I haven’t switched models.) It sucks because I was happily paying >$1k/mo in usage charges and then it all went south.
38. sneak ◴[] No.44384701{4}[source]
That’s just what they cost; MSRP is irrelevant. They’re not hard to find, they’re just expensive.
39. johndough ◴[] No.44384804{4}[source]
If the MPC tools come first in the conversation, it should be technically possible to cache the activations, so you do not have to recompute them each time.
40. PeterStuer ◴[] No.44385249[source]
Same. For 'sovereignty ' reasons I eventually will move to local processing, but for now in development/prototyping the gap with hosted LLM's seems too wide.
41. dchest ◴[] No.44385257{3}[source]
It's 4-bit quantized (Q4_K_M, 2.5 GB) and still works well for this task. It's amazing. I've been running various small models on this 8 GB Air since the first Llama and GPT-J, and they improved so much!

macOS virtual memory works well on swapping in and out stuff to SSD.

42. pests ◴[] No.44385388{4}[source]
Initial prompt processing with a large static context (system prompt + tools + whatever) could technically be improved by checkpointing the model state and reusing for future prompts. Not sure if any tools support this.
replies(1): >>44403891 #
43. esskay ◴[] No.44385534{5}[source]
Cursor became awful over the last few weeks so it's likely them, no idea what they did to their prompt but its just been incredibly poor at most tasks regardless of which model you pick.
44. tucnak ◴[] No.44386018{4}[source]
https://docs.vllm.ai/projects/production-stack/en/latest/tut...
45. storus ◴[] No.44386056[source]
If the rumors about splitting CPU/GPU in new Macs are true, your MacStudio will be the last one capable of running DeepSeek R1 671B Q4. It looks like Apple had an accidental winner that will go away with the end of unified RAM.
replies(1): >>44387131 #
46. storus ◴[] No.44386064[source]
RTX Pro 6000 can't do DeepSeek R1 671B Q4, you'd need 5-6 of them, which makes it way more expensive. Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.
replies(1): >>44386270 #
47. storus ◴[] No.44386069{4}[source]
M3 Ultra GPU is around 3070-3080 for the initial token processing. Not great, not terrible.
48. diggan ◴[] No.44386270{3}[source]
> Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.

No, Pro 6000 pulls max 600W, not sure where you get 1500W from, that's more than double the specification.

Besides, what is the token/second or second/token, and prompt processing speed for running DeepSeek R1 671B on a Mac Studio with Q4? Curious about those numbers, because I have a feeling they're very far off each other.

replies(1): >>44395739 #
49. diggan ◴[] No.44386303{5}[source]
> m4 with 24gb

Wow, that is probably analogous to 48GB on other systems then, if we were to ask an Apple VP?

replies(1): >>44392485 #
50. diggan ◴[] No.44386320{4}[source]
I think that's the thing. Compared to LM Studio, just running Ollama (fiddling around with terminals) is more complicated than the full E2E of chatting with LM Studio.

Of course, for folks used to terminals, daemons and so on it makes sense from the get go, but for others it seemingly doesn't, and it doesn't help that Ollama refuses to communicate what people should understand before trying to use it.

51. phren0logy ◴[] No.44387131[source]
I have not heard this rumor. Source?
replies(1): >>44387443 #
52. smcleod ◴[] No.44387179[source]
RTX is nice, but it's memory limited and requires to have a full desktop machine to run it in. I'd take slower inference (as long as it's not less than 15tk/s) for more memory any day!
replies(1): >>44388281 #
53. whatevsmate ◴[] No.44387384[source]
I did this a month ago and don't regret it one bit. I had a long laundry list of ML "stuff" I wanted to play with or questions to answer. There's no world in which I'm paying by the request, or token, or whatever, for hacking on fun projects. Keeping an eye on the meter is the opposite of having fun and I have absolutely nowhere I can put a loud, hot GPU (that probably has "gamer" lighting no less) in my fam's small apartment.
replies(1): >>44407585 #
54. prophesi ◴[] No.44387443{3}[source]
I believe they're talking about the rumors by an Apple supply chain analyst, Ming-Chi Kuo.

https://www.techspot.com/news/106159-apple-m5-silicon-rumore...

replies(1): >>44388382 #
55. MangoToupe ◴[] No.44388078{4}[source]
> And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

Am I the only person that gives aider instructions and leaves it alone for a few hours? This doesn't seem that difficult to integrate into my workflow.

replies(1): >>44388244 #
56. MangoToupe ◴[] No.44388091{5}[source]
Woah there Mr Money, slow down with these assumptions. A computer is worth the investment. But paying a cent extra to airlines? Unacceptable.
replies(1): >>44393695 #
57. diggan ◴[] No.44388244{5}[source]
> Am I the only person that gives aider instructions and leaves it alone for a few hours?

Probably not, but in my experience, if it takes longer than 10-15 minutes it's either stuck in a loop or down the wrong rabbit hole. But I don't use it for vibe coding or anything "big scope" like that, but more focused changes/refactors so YMMV

58. diggan ◴[] No.44388281{3}[source]
I'd love to see more Very-Large-Memory Mac Studio benchmarks for prompt processing and inference. The few benchmarks I've seem either missed to take prompt processing into account, didn't share exact weights+setup that were used or showed really abysmal performance.
replies(1): >>44407670 #
59. diggan ◴[] No.44388345[source]
There are some use cases I use LLMs for where I don't care a lot about the data being private (although that's a plus) but I don't want to pay XXX€ for classifying some data and I particularly don't want to worry about having to pay that again if I want to redo it with some changes.

Using local LLMs for this I don't worry about the price at all, I can leave it doing three tries per "task" without tripling the cost if I wanted to.

It's true that there is an upfront cost but way easier to get over that hump than on-demand/per-token costs, at least for me.

60. diggan ◴[] No.44388368{4}[source]
Some of us don't have the most reliable ISPs or even network infrastructure, and I say that as someone who lives in Spain :) I live outside a huge metropolitan area and Vodafone fiber went down twice this year, not even counting the time the country's electricity grid was down for like 24 hours.
61. diggan ◴[] No.44388382{4}[source]
Seems Apple is waking up to the fact that if it's too easy to run weights locally, there really isn't much sense to having their own remote inference endpoints, so time to stop the party :)
replies(1): >>44393922 #
62. seanmcdirmid ◴[] No.44389173{7}[source]
ya ya, just buy a VPN, pay the yearly subscription, and then have them disappear the week after you paid. Super trivially frustrating.
replies(1): >>44392519 #
63. vntok ◴[] No.44392485{6}[source]
Not sure what Apple VPs have to do with the tech but yeah, pretty much any core engineer you ask at Apple will tell you this.

Here is a nice article with some info about what memory compression is and how it works: https://arstechnica.com/gadgets/2013/10/os-x-10-9/#page-17

It's been a hard technical problem but is pretty much solved by now since its first debut in 2012-2013.

replies(1): >>44394032 #
64. vntok ◴[] No.44392519{8}[source]
VPN providers are first and foremost trust businesses. Why would you choose and pay one that is not well established and trusted? Mine have been there for more than a decade by now.

Alternatively, you could just set up your own (cheaper?) VPN relay on the tiniest VPS you can rent on AWS or IBM Cloud, right?

replies(1): >>44393687 #
65. datpuz ◴[] No.44393032[source]
I genuinely cannot wrap my head around spending this much money on hardware that is dramatically inferior to hardware that costs half the price. MacOS is not even great anymore, they stopped improving their UX like a decade ago.
replies(1): >>44407595 #
66. seanmcdirmid ◴[] No.44393687{9}[source]
The VPN providers that get you to jump the cloud in China are Chinese, and China is not yet a high trust society, just like how they’ll take your payment for one year of gym fees and then disappear the next week (sigh). If AWS or IBM cloud find out you are using them as a VPN to jump the GFW, they will ban you for life, Microsoft, IBM, Amazon, aren’t interested in having their whole cloud added to the GFW block list. Many people have tried this (including Microsfties in China with free Azure credits) and they’ve all been dealt with harshly by the cloud providers.
67. seanmcdirmid ◴[] No.44393695{6}[source]
The $3000 that a MBP M3 Max with 64GB of RAM costs might cover a round trip business class ticket for a trans pacific…if it is on sale (a Chinese carrier probably with GFW internet).
68. prophesi ◴[] No.44393922{5}[source]
I thought their goal was to completely remove the need for a remote inference endpoint in the first place? May have read your comment wrong.
replies(1): >>44395562 #
69. pxc ◴[] No.44394032{7}[source]
I've heard good things about how macOS handles memory relative to other operating systems. But Linux and Windows both have memory compression nowadays. So the claim is then not that memory compression makes your RAM twice as effective, but that macOS' memory compression is twice as good as the real and existing memory compression available on other operating systems.

Doesn't such a claim... need stronger evidence?

70. diggan ◴[] No.44395562{6}[source]
No, I think Apple been clear from the beginning that they won't be able to do everything on the devices themselves, that's why they're building the infrastructure/software for their "cloud intelligence system" or whatever they call it.
71. storus ◴[] No.44395739{4}[source]
You need at least 5x Pro 6000 (for smaller contexts), let's say Max-Q edition running at 300W, so overall you get a minimum of 1500W.

You get around 6 tokens/second which is not great but not terrible. If you use very long prompts, things get bad.

72. 112233 ◴[] No.44403891{5}[source]
Dropping in late into this discussion, but is there any way to "comfortably" use multiple precomputed kv-caches with current models, in the style of this work: https://arxiv.org/abs/2212.10947 ?

Meaning, I pre-parse multiple documents, and the prompt and completion attention sees all of them, but there is no attention between the documents (they are all encoded in the same overlapping positions).

This way you can include basically unlimited amount of data in the prompt, paying for it with the perfomance.

73. 112233 ◴[] No.44404092{3}[source]
I can run full deepseek r1 on m1 max with 64GB of ram. Around 0.5 t/s with small quant. Q4 quant of Maverick (253 GB) runs at 2.3 t/s on it (no GPU offload).

Practically, last gen or even ES/QS EPYC or Xeon (with AMX), enough RAM to fill all 8 or 12 channels plus fast storage (4 Gen5 NVMEs are almost 60 GB/s) on paper at least look like cheapest way to run these huge MoE models at hobbyist speeds.

replies(1): >>44455060 #
74. chisleu ◴[] No.44407585[source]
Right on. I also have a laundry list of ML things I want to do starting with fine tuning models.

I don't mind paying for models to do things like code. I like to move really fast when I'm coding. But for other things, I just didn't want to spend a week or two coming up on the hardware needed to build a GPU system. You can just order a big GPU box, but it's going to cost you astronomically right now. Building a system with 4-5 PCIE 5.0 x16 slots, enough power, enough pcie lanes... It's a lot to learn. You can't go on PC part picker and just hunt a motherboard with 6 double slots.

This is a machine to let me do some things with local models. My first goal is to run some quantized version of the new V3 model and try to use it for coding tasks.

I expect it will be slow for sure, but I just want to know what it's capable of.

75. chisleu ◴[] No.44407595[source]
How can you say something so brave, and so wrong?
76. chisleu ◴[] No.44407623[source]
Only on HN can buying a $12k badass computer be a waste of money
77. chisleu ◴[] No.44407647{4}[source]
You are correct that inference speed per $ is not optimized with this purchase.

What is optimized is the ability to find tune medium size models (~200GB) / $

You just can't get 500GB of VRAM for less than $100k. Even with $9k Blackwell cards, you have $10k in a barebones GPU server. You can't use commodity hardware and cluster it because you need fast interconnects. I'm talking 200-400GB/s interconnects. And those take yet another PCIe slot and require expensive Infiniband switches.

Shit gets costly fast. I consternated about this purchase for weeks. Eventually deciding that it's the easiest path to success for my purposes. Not for everyone's, but for mine.

78. chisleu ◴[] No.44407670{4}[source]
Oh I plan to produce a ton of that. I'll post a blog on it to HN and /r/localllama when I'm done.
79. marci ◴[] No.44455060{4}[source]
If you're talking about Deepseek r1 with llama.cpp and mmap, then at this point you can run deepseek r1 on a raspberry zero with a 256GB micro sdcard and a phone charger. The only metric left to know is one's patience.