←back to thread

82 points meetpateltech | 5 comments | | HN request time: 0.226s | source
1. RayVR ◴[] No.45311094[source]
A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests?
replies(4): >>45311127 #>>45311184 #>>45311402 #>>45311754 #
2. NitpickLawyer ◴[] No.45311127[source]
> Can anyone explain why that makes sense?

Can be anything from different arch, more data, RL, etc. It's probably RL. In recent months top tier labs seem to have "cracked" RL to a level not seen yet in open models, and by a large margin.

3. raincole ◴[] No.45311184[source]
Just two different models branded under similar names. That's it. Grok 4 is not the slower version of Grok 4 Fast, just like gpt-4 is not the slower version of gpt-4o.
4. yorwba ◴[] No.45311402[source]
It doesn't outperform uniformly across benchmarks. It's worse than Grok 4 on GPQA Diamond and HLE (Humanity's Last Exam) without tools, both of which require the model to have memorized a large number of facts. Large (and thus slow) models typically do better on these.

The other benchmarks focus on reasoning and tool use, so the model doesn't need to have memorized quite so many facts, it just needs to be able to transform them from one representation to another. (E.g. user question to search tool call; list of search results to concise answer.) Larger models should in theory also be better at that, but you need to train them for those specific tasks first.

So I don't think they simply trained on the benchmark tests, but they shifted their training mix to emphasize particular tasks more, and now in the announcement they highlight benchmarks that test those tasks and where their model performs better.

You could also write an anti-announcement by picking a few more fact recall benchmarks and highlighting that it does worse at those. (I assume.)

5. uyzstvqs ◴[] No.45311754[source]
Grok 4 Fast is likely Grok 4 distilled down to remove noise that rarely if ever gets activated in production. Then you'd expect these results, as it's really the same logic copied from the big model, but more focused.