←back to thread

579 points paulpauper | 1 comments | | HN request time: 0.001s | source
Show context
InkCanon ◴[] No.43604503[source]
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #
sanxiyn ◴[] No.43606419[source]
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
replies(1): >>43607419 #
InkCanon ◴[] No.43607419[source]
o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.

https://openai.com/index/learning-to-reason-with-llms/

The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.

replies(2): >>43607571 #>>43608872 #
1. sanxiyn ◴[] No.43608872[source]
AIME is so not IMO.