(api-docs.deepseek.com)

776 points wertyk | 3 comments | 21 Aug 25 19:06 UTC | HN request time: 0.001s | source

Show context

hodgehog11 ◴[21 Aug 25 20:01 UTC] No.44977357[source]▶

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

replies(6): >>44977423 #>>44977655 #>>44977754 #>>44977946 #>>44978395 #>>44978560 #

1. segmondy ◴[21 Aug 25 21:37 UTC] No.44978395[source]▶

>>44977357 #

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

replies(2): >>44981147 #>>44988389 #

2. paradite ◴[22 Aug 25 05:00 UTC] No.44981147[source]▶

>>44978395 (TP) #

Hey. I like your roast on benchmarks.

I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:

Example recent one on GPT-5:

https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...

All results:

https://eval.16x.engineer/evals/coding

3. jstummbillig ◴[22 Aug 25 18:56 UTC] No.44988389[source]▶

>>44978395 (TP) #

Which benchmarks are not garbage?

I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.

↑

DeepSeek-v3.1