Most active commenters

    ←back to thread

    579 points paulpauper | 16 comments | | HN request time: 1.506s | source | bottom
    Show context
    InkCanon ◴[] No.43604503[source]
    The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
    replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #
    usaar333 ◴[] No.43605147[source]
    And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.

    And this only suggested LLMs aren't trained well to write formal math proofs, which is true.

    replies(2): >>43607028 #>>43609276 #
    1. selcuka ◴[] No.43607028[source]
    > within a week

    How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.

    replies(2): >>43607092 #>>43614328 #
    2. levocardia ◴[] No.43607092[source]
    They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
    replies(5): >>43607320 #>>43607428 #>>43607553 #>>43608063 #>>43610236 #
    3. AIPedant ◴[] No.43607320[source]
    The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:

    o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202

    Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876

    DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...

    Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.

    replies(1): >>43607773 #
    4. akoboldfrying ◴[] No.43607428[source]
    >one particular nonstandard eval

    A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?

    Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"

    replies(1): >>43607620 #
    5. cma ◴[] No.43607553[source]
    They could have rlhfed or finetuned on user thumbs up responses, which could include users who took the test and asked it to explain problems after
    6. og_kalu ◴[] No.43607620{3}[source]
    It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.
    replies(1): >>43609303 #
    7. leonidasv ◴[] No.43607773{3}[source]
    Gemini 2.5 Pro got the farmer problem variation right: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
    replies(2): >>43608001 #>>43609491 #
    8. greenmartian ◴[] No.43608001{4}[source]
    When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.
    replies(1): >>43609226 #
    9. 112233 ◴[] No.43608063[source]
    Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?
    10. yyy3ww2 ◴[] No.43609226{5}[source]
    > When told, "only room for one person OR one animal"

    In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.

    11. suddenlybananas ◴[] No.43609303{4}[source]
    It's really not that implausible, they probably are adding stuff to the data-soup all the time and have a system in place for it.
    replies(1): >>43610058 #
    12. Tepix ◴[] No.43609491{4}[source]
    That can't be viewed without logging into Google first.
    13. og_kalu ◴[] No.43610058{5}[source]
    Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.
    14. NiloCK ◴[] No.43610236[source]
    I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.

    Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.

    replies(1): >>43610319 #
    15. alphabetting ◴[] No.43610319{3}[source]
    Google team said it was outside the training window fwiw

    https://x.com/jack_w_rae/status/1907454713563426883

    16. bakkoting ◴[] No.43614328[source]
    New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.

    And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.