(api-docs.deepseek.com)

776 points wertyk | 1 comments | 21 Aug 25 19:06 UTC | HN request time: 0s | source

Show context

hodgehog11 ◴[21 Aug 25 20:01 UTC] No.44977357[source]▶

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

replies(6): >>44977423 #>>44977655 #>>44977754 #>>44977946 #>>44978395 #>>44978560 #

guluarte ◴[21 Aug 25 20:53 UTC] No.44977946[source]▶

>>44977357 #

tbh companies like anthopic, openai, create custom agents for specific benchmarks

replies(2): >>44978101 #>>44979380 #

bazmattaz ◴[21 Aug 25 21:07 UTC] No.44978101[source]▶

>>44977946 #

Do you have a source for this? I’m intrigued

replies(1): >>44978244 #

1. guluarte ◴[21 Aug 25 21:21 UTC] No.44978244{3}[source]▶

>>44978101 #

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."

↑

DeepSeek-v3.1