Most active commenters

bigyabai(3)
pdimitar(3)
esafak(3)

Popular/hot comments

>>44977904 #
>>44977707 #

←back to thread

DeepSeek-v3.1

(api-docs.deepseek.com)

Show context

esafak ◴[21 Aug 25 20:12 UTC] No.44977474[source]▶

>>44976764 (OP) #

It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1

replies(2): >>44977550 #>>44981531 #

1. bigyabai ◴[21 Aug 25 20:18 UTC] No.44977550[source]▶

>>44977474 #

Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.

replies(4): >>44977707 #>>44978006 #>>44978062 #>>44979710 #

2. pdimitar ◴[21 Aug 25 20:32 UTC] No.44977707[source]▶

>>44977550 (TP) #

Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?

I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.

replies(3): >>44977887 #>>44977902 #>>44978104 #

3. gunalx ◴[21 Aug 25 20:48 UTC] No.44977887[source]▶

>>44977707 #

You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.

replies(1): >>44977904 #

4. bigyabai ◴[21 Aug 25 20:49 UTC] No.44977902[source]▶

>>44977707 #

Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.

5. pdimitar ◴[21 Aug 25 20:49 UTC] No.44977904{3}[source]▶

>>44977887 #

Oh, we can run LLMs efficiently with AMD GPUs now? Pretty cool, I haven't been following, thank you.

replies(4): >>44978437 #>>44984429 #>>44984563 #>>44989107 #

6. tomr75 ◴[21 Aug 25 20:58 UTC] No.44978006[source]▶

>>44977550 (TP) #

With qwen code?

7. decide1000 ◴[21 Aug 25 21:03 UTC] No.44978062[source]▶

>>44977550 (TP) #

I use it on a 24gb gpu Tesla P40. Very happy with the result.

replies(1): >>44978305 #

8. oktoberpaard ◴[21 Aug 25 21:07 UTC] No.44978104[source]▶

>>44977707 #

I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.

replies(1): >>44978144 #

9. pdimitar ◴[21 Aug 25 21:10 UTC] No.44978144{3}[source]▶

>>44978104 #

Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.

Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?

replies(2): >>44978798 #>>44980758 #

10. hkt ◴[21 Aug 25 21:28 UTC] No.44978305[source]▶

>>44978062 #

Out of interest, roughly how many tokens per second do you get on that?

replies(1): >>44978371 #

11. edude03 ◴[21 Aug 25 21:35 UTC] No.44978371{3}[source]▶

>>44978305 #

Like 4. Definitely single digit. The P40s are slow af

replies(1): >>44979987 #

12. DarkFuture ◴[21 Aug 25 21:42 UTC] No.44978437{4}[source]▶

>>44977904 #

I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:

llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf

And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.

13. arcanemachiner ◴[21 Aug 25 22:21 UTC] No.44978798{4}[source]▶

>>44978144 #

Yes, Ollama is very plug-and-play when it comes to multi GPU.

llama.cpp probably is too, but I haven't tried it with a bigger model yet.

14. indigodaddy ◴[22 Aug 25 00:08 UTC] No.44979710[source]▶

>>44977550 (TP) #

Do we get these good qwen models when using qwen-code CLI tool and authing via qwen.ai account?

replies(2): >>44987679 #>>44989538 #

15. coolspot ◴[22 Aug 25 00:56 UTC] No.44979987{4}[source]▶

>>44978371 #

P40 has memory bandwidth of 346GB/s which means it should be able to do around 14+ t/s running a 24 GB model+context.

replies(1): >>45043333 #

16. SV_BubbleTime ◴[22 Aug 25 03:31 UTC] No.44980758{4}[source]▶

>>44978144 #

Even today some progress was released on parallelizing WAN video generation over multiple GPUs. LLMs are way easier to split up.

17. bavell ◴[22 Aug 25 13:27 UTC] No.44984429{4}[source]▶

>>44977904 #

IDK about "efficiently" but we've been able to run llms locally with AMD for 1.5-2 years now

18. Plasmoid2000ad ◴[22 Aug 25 13:39 UTC] No.44984563{4}[source]▶

>>44977904 #

Yes - I'm running a LM Studio on windows on a 6800xt, and everything works more-or-less out of the box using always using Vulkan llama.cpp on the gpu I believe.

There's also ROCm. That's not working for me in LM Studio at the moment. I used that early last year to get some LLMs and stable diffusion running. As far as I can tell, it was faster before, but Vulkan implementations have caught up or something - so much the mucking about isn't often worth it. I believe ROCm is hit or miss for a lot of people, especially on windows.

19. bigyabai ◴[22 Aug 25 18:05 UTC] No.44987679[source]▶

>>44979710 #

I'm not sure, probably?

20. green7ea ◴[22 Aug 25 19:58 UTC] No.44989107{4}[source]▶

>>44977904 #

llama.cpp and lmstudio have a Vulkan backend which is pretty fast. I'm using it to run models on a Strix Halo laptop and it works pretty well.

21. esafak ◴[22 Aug 25 20:39 UTC] No.44989538[source]▶

>>44979710 #

You do not need qwen-code or qwen.ai to use them; openrouter + opencode suffice.

replies(1): >>44989707 #

22. indigodaddy ◴[22 Aug 25 20:56 UTC] No.44989707{3}[source]▶

>>44989538 #

Right, I'm aware, was just wondering about that specific scenario.

replies(1): >>44989881 #

23. esafak ◴[22 Aug 25 21:13 UTC] No.44989881{4}[source]▶

>>44989707 #

I don't know about qwen.ai but you can use that model in qwen-cli through openrouter or Alibaba Cloud ModelStudio: https://www.alibabacloud.com/help/en/model-studio/models#42e...

24. edude03 ◴[27 Aug 25 18:38 UTC] No.45043333{5}[source]▶

>>44979987 #

Not sure why I got downvoted - literally the first result (for me) says the best result[0] is 11t/s at Q3. Everything else is single digits, like 2-8t/s. Also considering that its not supported anymore[1] (It's Compute Capability is 6.1, not supported by cuda anymore) and it's power draw, I'd highly recommend anyone interested in ML stay far away from it - even if its all you can afford.

While the memory bandwidth is decent, you do actually need to do matmuls and other compute operations for LLMs, which again its pretty slow at

[0]: https://old.reddit.com/r/LocalLLaMA/comments/1dcdit2/p40_ben... [1]: https://developer.nvidia.com/cuda-gpus

↑