Model capability improvements are very uneven. Changes between one model and the next tend to benefit certain areas substantially without moving the needle on others. You see this across all frontier labs’ model releases. Also the version numbering is BS (remember GPT-4.5 followed by GPT-4.1?).