DeepSeek-v3.1

(api-docs.deepseek.com)

Show context

hodgehog11 ◴[21 Aug 25 20:01 UTC] No.44977357[source]▶

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

replies(6): >>44977423 #>>44977655 #>>44977754 #>>44977946 #>>44978395 #>>44978560 #

1. coliveira ◴[21 Aug 25 20:28 UTC] No.44977655[source]▶

>>44977357 #

My personal experience is that it produces high quality results.

replies(2): >>44977708 #>>44980748 #

2. amrrs ◴[21 Aug 25 20:32 UTC] No.44977708[source]▶

>>44977655 (TP) #

Any example or prompt you use to make this statment?

replies(2): >>44977903 #>>44979268 #

3. imachine1980_ ◴[21 Aug 25 20:49 UTC] No.44977903[source]▶

>>44977708 #

I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.

replies(2): >>44978739 #>>44980475 #

4. mycall ◴[21 Aug 25 22:14 UTC] No.44978739{3}[source]▶

>>44977903 #

I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.

5. sync ◴[21 Aug 25 23:10 UTC] No.44979268[source]▶

>>44977708 #

I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.

replies(2): >>44979646 #>>44983231 #

6. dr_dshiv ◴[21 Aug 25 23:59 UTC] No.44979646{3}[source]▶

>>44979268 #

Strong claim there!

7. valtism ◴[22 Aug 25 02:33 UTC] No.44980475{3}[source]▶

>>44977903 #

Was that true for GPT-5? They claim it is much better at not hallucinating

8. SV_BubbleTime ◴[22 Aug 25 03:28 UTC] No.44980748[source]▶

>>44977655 (TP) #

Vine is about the only benchmark I think is real.

We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?

9. antman ◴[22 Aug 25 11:33 UTC] No.44983231{3}[source]▶

>>44979268 #

Nice point. How did you test for coreference resolution? Specific prompt or dataset?

↑