←back to thread

82 points meetpateltech | 1 comments | | HN request time: 0.429s | source
Show context
RayVR ◴[] No.45311094[source]
A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests?
replies(4): >>45311127 #>>45311184 #>>45311402 #>>45311754 #
1. yorwba ◴[] No.45311402[source]
It doesn't outperform uniformly across benchmarks. It's worse than Grok 4 on GPQA Diamond and HLE (Humanity's Last Exam) without tools, both of which require the model to have memorized a large number of facts. Large (and thus slow) models typically do better on these.

The other benchmarks focus on reasoning and tool use, so the model doesn't need to have memorized quite so many facts, it just needs to be able to transform them from one representation to another. (E.g. user question to search tool call; list of search results to concise answer.) Larger models should in theory also be better at that, but you need to train them for those specific tasks first.

So I don't think they simply trained on the benchmark tests, but they shifted their training mix to emphasize particular tasks more, and now in the announcement they highlight benchmarks that test those tasks and where their model performs better.

You could also write an anti-announcement by picking a few more fact recall benchmarks and highlighting that it does worse at those. (I assume.)