That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.
They're clearly building better training datasets and doing extensive RL on these benchmarks over time. The out of distribution performance is still awful.