DeepSeek-v3.1-Terminus | slacker news

1. sbinnee ◴[22 Sep 25 12:45 UTC] No.45332653[source]▶

> What’s improved? Language consistency: fewer CN/EN mix-ups & no more random chars.

It's good that they made this improvement. But is there any advantages at this point using DeepSeek over Qwen?

replies(4): >>45332751 #>>45332752 #>>45333575 #>>45336644 #

2. IgorPartola ◴[22 Sep 25 12:54 UTC] No.45332751[source]▶

I wish there was some easy resource to keep up with the latest models. The best I have come up with so far is asking one model to research the others. Realistically I want to know latest versions, best use case, performance (in terms of speed) relative to some baseline, and hardware requirements to run it.

replies(3): >>45333280 #>>45333716 #>>45335468 #

3. comrade1234 ◴[22 Sep 25 12:54 UTC] No.45332752[source]▶

>>45332653 (TP) #

MIT license that lets you run it on your own hardware and make money off of it.

replies(1): >>45333537 #

4. exe34 ◴[22 Sep 25 13:35 UTC] No.45333280[source]▶

>>45332751 #

> asking one model to research the others.

that's basically choosing are random with extra steps!

replies(1): >>45333527 #

5. throwup238 ◴[22 Sep 25 13:54 UTC] No.45333527{3}[source]▶

>>45333280 #

Research not spit out the answer based on weights. Just ask Gemini/Claude to do deep research on /r/LocalLLama and HN posts.

6. coder543 ◴[22 Sep 25 13:55 UTC] No.45333537[source]▶

>>45332752 #

Qwen3 models (including their 235B and 480B models) use the Apache-2.0 license, so it’s not like that’s a big difference here.

7. coder543 ◴[22 Sep 25 13:58 UTC] No.45333575[source]▶

>>45332653 (TP) #

They seem fairly competitive with each other. You would have to benchmark them for your specific use case.

8. Jgoauh ◴[22 Sep 25 14:10 UTC] No.45333716[source]▶

>>45332751 #

have you tried https://artificialanalysis.ai/

replies(2): >>45334600 #>>45348957 #

9. JimDugan ◴[22 Sep 25 15:14 UTC] No.45334600{3}[source]▶

>>45333716 #

Dumb collation of benchmarks that the big labs are essentially training on. Livebench.ai is the industry standard - non contaminated, new questions every few months.

replies(1): >>45334853 #

10. IgorPartola ◴[22 Sep 25 15:31 UTC] No.45334853{4}[source]▶

>>45334600 #

Thanks! Are the scores in some way linear here? As in, if model A is rated at 25 and model B at 50, does that mean I will have half the mistakes with model B? Get answers that are 2x more accurate? Or is it subjective?

replies(1): >>45340887 #

11. __mharrison__ ◴[22 Sep 25 16:09 UTC] No.45335468[source]▶

>>45332751 #

I use Aider heavily and find their benchmark to be pretty good. It is updated relatively frequently (a month ago, which may be an eternity in AI time).

https://aider.chat/docs/leaderboards/

12. twotwotwo ◴[22 Sep 25 17:23 UTC] No.45336644[source]▶

>>45332653 (TP) #

The fast Cerebras thing got me to try the Qwen3 models. I couldn't get them working all that well: they had trouble using the required output format and following instructions. On the other hand, benchmarks say they should be great, and it sounds like maybe some people use them OK via different tools.

I'm curious if my experience was unusual (it very much could be!) and I'd be interested to hear from anyone who's used both.

13. esafak ◴[22 Sep 25 23:25 UTC] No.45340887{5}[source]▶

>>45334853 #

I believe the score represents the fraction of correct answers, so yes.

14. alexeiz ◴[23 Sep 25 16:06 UTC] No.45348957{3}[source]▶

>>45333716 #

It says the best "coding index" is held by Grok 4 and Gemini 2.5 Pro. Give me a break. Nobody uses those models for serious coding. It's dominated by Sonnet 4/Opus 4.1 and GPT-5.