I think it’s just that the base model is good at real world coding tasks - as opposed to the types of coding tasks in the common benchmarks.
If you use GitHub Copilot - which has its own system level prompts - you can hotswap between models, and Claude outperforms OpenAI’s and Google’s models by such a large margin that the others are functionally useless in comparison.
replies(4):