ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.
Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.
The benchmark isn’t particularly strong against gaming, especially with private data.
A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.
Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.
IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.