←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 4 comments | | HN request time: 0.001s | source
Show context
josalhor ◴[] No.46235005[source]
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #
verdverm ◴[] No.46235062[source]
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #
HDThoreaun ◴[] No.46235492[source]
Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.
replies(3): >>46236017 #>>46236535 #>>46236978 #
CamperBob2 ◴[] No.46236017{3}[source]
Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

replies(5): >>46236205 #>>46236247 #>>46236865 #>>46237072 #>>46237171 #
npinsker ◴[] No.46236205{4}[source]
Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

replies(2): >>46236598 #>>46236995 #
1. CamperBob2 ◴[] No.46236598{5}[source]
Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.

replies(1): >>46237004 #
2. fwip ◴[] No.46237004[source]
Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.

Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.

replies(1): >>46237079 #
3. CamperBob2 ◴[] No.46237079[source]
Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.

What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.

replies(1): >>46237199 #
4. fwip ◴[] No.46237199{3}[source]
Dunno :) I'm not an expert at LLMs or test design, I just see a lot of similarity between IQ tests and these questions.