https://artificialanalysis.ai/models/deepseek-v3-1-reasoning
https://artificialanalysis.ai/models/deepseek-v3-1-reasoning
https://openrouter.ai/openai/gpt-oss-120b and https://openrouter.ai/deepseek/deepseek-chat-v3.1 for the same providers is probably better, although gpt-oss-120b has been around long enough to have more providers, and presumably for hosters to get comfortable with it / optimize hosting of it.
It's knowledge seems to be lacking even compared to gpt3.
No idea how you'd benchmark this though.
Let's hope not, because gpt-oss-120B can be dramatically moronical. I am guessing the MoE contains some very dumb subnets.
Benchmarks can be a starting point, but you really have to see how the results work for you.
Qwen3-Coder-480B-A35B-Instruct
GLM4.5 Air
Kimi K2
DeepSeek V3 0324 / R1 0528
GPT-5 Mini
Thanks for any feedback!
That is the point of these small models. Remove the bloat of obscure information (address that with RAG), leaving behind a core “reasoning” skeleton.