←back to thread

579 points paulpauper | 1 comments | | HN request time: 0.001s | source
Show context
InkCanon ◴[] No.43604503[source]
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #
usaar333 ◴[] No.43605147[source]
And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.

And this only suggested LLMs aren't trained well to write formal math proofs, which is true.

replies(2): >>43607028 #>>43609276 #
selcuka ◴[] No.43607028[source]
> within a week

How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.

replies(2): >>43607092 #>>43614328 #
1. bakkoting ◴[] No.43614328[source]
New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.

And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.