And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.