Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 1 comments | 06 Apr 25 18:01 UTC | HN request time: 0.251s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

usaar333 ◴[06 Apr 25 21:34 UTC] No.43605147[source]▶

>>43604503 #

And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.

And this only suggested LLMs aren't trained well to write formal math proofs, which is true.

replies(2): >>43607028 #>>43609276 #

selcuka ◴[07 Apr 25 02:33 UTC] No.43607028[source]▶

>>43605147 #

> within a week

How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.

replies(2): >>43607092 #>>43614328 #

levocardia ◴[07 Apr 25 02:44 UTC] No.43607092[source]▶

>>43607028 #

They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...

replies(5): >>43607320 #>>43607428 #>>43607553 #>>43608063 #>>43610236 #

1. cma ◴[07 Apr 25 04:05 UTC] No.43607553[source]▶

>>43607092 #

They could have rlhfed or finetuned on user thumbs up responses, which could include users who took the test and asked it to explain problems after

↑