←back to thread

DeepSeek-v3.1

(api-docs.deepseek.com)
776 points wertyk | 1 comments | | HN request time: 0s | source
Show context
hodgehog11 ◴[] No.44977357[source]
For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

replies(6): >>44977423 #>>44977655 #>>44977754 #>>44977946 #>>44978395 #>>44978560 #
guluarte ◴[] No.44977946[source]
tbh companies like anthopic, openai, create custom agents for specific benchmarks
replies(2): >>44978101 #>>44979380 #
bazmattaz ◴[] No.44978101[source]
Do you have a source for this? I’m intrigued
replies(1): >>44978244 #
1. guluarte ◴[] No.44978244{3}[source]
https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."