A faster model that outperforms its slower version on multiple benchmarks? Can anyone explain why that makes sense? Are they simply retraining on the benchmark tests?
Grok 4 Fast is likely Grok 4 distilled down to remove noise that rarely if ever gets activated in production. Then you'd expect these results, as it's really the same logic copied from the big model, but more focused.