←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 1 comments | | HN request time: 0s | source
Show context
josalhor ◴[] No.46235005[source]
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #
verdverm ◴[] No.46235062[source]
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #
HDThoreaun ◴[] No.46235492[source]
Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.
replies(3): >>46236017 #>>46236535 #>>46236978 #
CamperBob2 ◴[] No.46236017{3}[source]
Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

replies(5): >>46236205 #>>46236247 #>>46236865 #>>46237072 #>>46237171 #
1. ACCount37 ◴[] No.46237171{4}[source]
With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.