And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.