Most active commenters

chisleu(4)
storus(3)
diggan(3)

Popular/hot comments

>>44383770 #

←back to thread

MCP in LM Studio

(lmstudio.ai)

Show context

chisleu ◴[25 Jun 25 17:58 UTC] No.44380098[source]▶

>>44379792 (OP) #

Just ordered a $12k mac studio w/ 512GB of integrated RAM.

Can't wait for it to arrive and crank up LM Studio. It's literally the first install. I'm going to download it with safari.

LM Studio is newish, and it's not a perfect interface yet, but it's fantastic at what it does which is bring local LLMs to the masses w/o them having to know much.

There is another project that people should be aware of: https://github.com/exo-explore/exo

Exo is this radically cool tool that automatically clusters all hosts on your network running Exo and uses their combined GPUs for increased throughput.

Like HPC environments, you are going to need ultra fast interconnects, but it's just IP based.

replies(15): >>44380196 #>>44380217 #>>44380386 #>>44380596 #>>44380626 #>>44380956 #>>44381072 #>>44381075 #>>44381174 #>>44381177 #>>44381267 #>>44385069 #>>44386056 #>>44387384 #>>44393032 #

1. zackify ◴[25 Jun 25 19:44 UTC] No.44381177[source]▶

>>44380098 #

I love LM studio but I’d never waste 12k like that. The memory bandwidth is too low trust me.

Get the RTX Pro 6000 for 8.5k with double the bandwidth. It will be way better

replies(6): >>44382823 #>>44382833 #>>44383071 #>>44386064 #>>44387179 #>>44407623 #

2. marci ◴[25 Jun 25 23:38 UTC] No.44382823[source]▶

>>44381177 (TP) #

You can't run deepseek-v3/r1 on the RTX Pro 6000, not to mention the upcomming 1 million context qwen models, or the current qwen3-235b.

replies(1): >>44404092 #

3. tymscar ◴[25 Jun 25 23:41 UTC] No.44382833[source]▶

>>44381177 (TP) #

Why would they pay 2/3 of the price for something with 1/5 of ram?

The whole point of spending that much money for them is to run massive models, like the full R1, which the Pro 6000 cant

replies(1): >>44383770 #

4. t1amat ◴[26 Jun 25 00:22 UTC] No.44383071[source]▶

>>44381177 (TP) #

(Replying to both siblings questioning this)

If the primary use case is input heavy, which is true of agentic tools, there’s a world where partial GPU offload with many channels of DDR5 system RAM leads to an overall better experience. A good GPU will process input many times faster, and with good RAM you might end up with decent output speed still. Seems like that would come in close to $12k?

And there would be no competition for models that do fit entirely inside that VRAM, for example Qwen3 32B.

5. zackify ◴[26 Jun 25 02:39 UTC] No.44383770[source]▶

>>44382833 #

Because waiting forever for initial prompt processing with realistic number of MCP tools enabled on a prompt is going to suck without the most bandwidth possible

And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

If you’re using it for background tasks and not coding it’s a different story

replies(6): >>44384804 #>>44385388 #>>44386018 #>>44386069 #>>44388078 #>>44407647 #

6. johndough ◴[26 Jun 25 06:44 UTC] No.44384804{3}[source]▶

>>44383770 #

If the MPC tools come first in the conversation, it should be technically possible to cache the activations, so you do not have to recompute them each time.

7. pests ◴[26 Jun 25 08:31 UTC] No.44385388{3}[source]▶

>>44383770 #

Initial prompt processing with a large static context (system prompt + tools + whatever) could technically be improved by checkpointing the model state and reusing for future prompts. Not sure if any tools support this.

replies(1): >>44403891 #

8. tucnak ◴[26 Jun 25 10:35 UTC] No.44386018{3}[source]▶

>>44383770 #

https://docs.vllm.ai/projects/production-stack/en/latest/tut...

9. storus ◴[26 Jun 25 10:43 UTC] No.44386064[source]▶

>>44381177 (TP) #

RTX Pro 6000 can't do DeepSeek R1 671B Q4, you'd need 5-6 of them, which makes it way more expensive. Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.

replies(1): >>44386270 #

10. storus ◴[26 Jun 25 10:44 UTC] No.44386069{3}[source]▶

>>44383770 #

M3 Ultra GPU is around 3070-3080 for the initial token processing. Not great, not terrible.

11. diggan ◴[26 Jun 25 11:21 UTC] No.44386270[source]▶

>>44386064 #

> Moreover, MacStudio will do it at 150W whereas Pro 6000 would start at 1500W.

No, Pro 6000 pulls max 600W, not sure where you get 1500W from, that's more than double the specification.

Besides, what is the token/second or second/token, and prompt processing speed for running DeepSeek R1 671B on a Mac Studio with Q4? Curious about those numbers, because I have a feeling they're very far off each other.

replies(1): >>44395739 #

12. smcleod ◴[26 Jun 25 13:23 UTC] No.44387179[source]▶

>>44381177 (TP) #

RTX is nice, but it's memory limited and requires to have a full desktop machine to run it in. I'd take slower inference (as long as it's not less than 15tk/s) for more memory any day!

replies(1): >>44388281 #

13. MangoToupe ◴[26 Jun 25 14:56 UTC] No.44388078{3}[source]▶

>>44383770 #

> And you are never going to sit around waiting for anything larger than the 96+gb of ram that the RTX pro has.

Am I the only person that gives aider instructions and leaves it alone for a few hours? This doesn't seem that difficult to integrate into my workflow.

replies(1): >>44388244 #

14. diggan ◴[26 Jun 25 15:17 UTC] No.44388244{4}[source]▶

>>44388078 #

> Am I the only person that gives aider instructions and leaves it alone for a few hours?

Probably not, but in my experience, if it takes longer than 10-15 minutes it's either stuck in a loop or down the wrong rabbit hole. But I don't use it for vibe coding or anything "big scope" like that, but more focused changes/refactors so YMMV

15. diggan ◴[26 Jun 25 15:21 UTC] No.44388281[source]▶

>>44387179 #

I'd love to see more Very-Large-Memory Mac Studio benchmarks for prompt processing and inference. The few benchmarks I've seem either missed to take prompt processing into account, didn't share exact weights+setup that were used or showed really abysmal performance.

replies(1): >>44407670 #

16. storus ◴[27 Jun 25 10:58 UTC] No.44395739{3}[source]▶

>>44386270 #

You need at least 5x Pro 6000 (for smaller contexts), let's say Max-Q edition running at 300W, so overall you get a minimum of 1500W.

You get around 6 tokens/second which is not great but not terrible. If you use very long prompts, things get bad.

17. 112233 ◴[28 Jun 25 11:26 UTC] No.44403891{4}[source]▶

>>44385388 #

Dropping in late into this discussion, but is there any way to "comfortably" use multiple precomputed kv-caches with current models, in the style of this work: https://arxiv.org/abs/2212.10947 ?

Meaning, I pre-parse multiple documents, and the prompt and completion attention sees all of them, but there is no attention between the documents (they are all encoded in the same overlapping positions).

This way you can include basically unlimited amount of data in the prompt, paying for it with the perfomance.

18. 112233 ◴[28 Jun 25 12:08 UTC] No.44404092[source]▶

>>44382823 #

I can run full deepseek r1 on m1 max with 64GB of ram. Around 0.5 t/s with small quant. Q4 quant of Maverick (253 GB) runs at 2.3 t/s on it (no GPU offload).

Practically, last gen or even ES/QS EPYC or Xeon (with AMX), enough RAM to fill all 8 or 12 channels plus fast storage (4 Gen5 NVMEs are almost 60 GB/s) on paper at least look like cheapest way to run these huge MoE models at hobbyist speeds.

replies(1): >>44455060 #

19. chisleu ◴[28 Jun 25 19:51 UTC] No.44407623[source]▶

>>44381177 (TP) #

Only on HN can buying a $12k badass computer be a waste of money

20. chisleu ◴[28 Jun 25 19:55 UTC] No.44407647{3}[source]▶

>>44383770 #

You are correct that inference speed per $ is not optimized with this purchase.

What is optimized is the ability to find tune medium size models (~200GB) / $

You just can't get 500GB of VRAM for less than $100k. Even with $9k Blackwell cards, you have $10k in a barebones GPU server. You can't use commodity hardware and cluster it because you need fast interconnects. I'm talking 200-400GB/s interconnects. And those take yet another PCIe slot and require expensive Infiniband switches.

Shit gets costly fast. I consternated about this purchase for weeks. Eventually deciding that it's the easiest path to success for my purposes. Not for everyone's, but for mine.

21. chisleu ◴[28 Jun 25 19:58 UTC] No.44407670{3}[source]▶

>>44388281 #

Oh I plan to produce a ton of that. I'll post a blog on it to HN and /r/localllama when I'm done.

22. marci ◴[03 Jul 25 13:48 UTC] No.44455060{3}[source]▶

>>44404092 #

If you're talking about Deepseek r1 with llama.cpp and mmap, then at this point you can run deepseek r1 on a raspberry zero with a 256GB micro sdcard and a phone charger. The only metric left to know is one's patience.

↑