GPT-5.2

(openai.com)

1019 points atgctg | 4 comments | 11 Dec 25 18:04 UTC | HN request time: 0.001s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

HDThoreaun ◴[11 Dec 25 18:56 UTC] No.46235492[source]▶

>>46235062 #

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

replies(3): >>46236017 #>>46236535 #>>46236978 #

CamperBob2 ◴[11 Dec 25 19:29 UTC] No.46236017{3}[source]▶

>>46235492 #

Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

replies(5): >>46236205 #>>46236247 #>>46236865 #>>46237072 #>>46237171 #

npinsker ◴[11 Dec 25 19:46 UTC] No.46236205{4}[source]▶

>>46236017 #

Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

replies(2): >>46236598 #>>46236995 #

1. CamperBob2 ◴[11 Dec 25 20:19 UTC] No.46236598{5}[source]▶

>>46236205 #

Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.

replies(1): >>46237004 #

2. fwip ◴[11 Dec 25 20:54 UTC] No.46237004[source]▶

>>46236598 (TP) #

Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.

Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.

replies(1): >>46237079 #

3. CamperBob2 ◴[11 Dec 25 21:00 UTC] No.46237079[source]▶

>>46237004 #

Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.

What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.

replies(1): >>46237199 #

4. fwip ◴[11 Dec 25 21:09 UTC] No.46237199{3}[source]▶

>>46237079 #

Dunno :) I'm not an expert at LLMs or test design, I just see a lot of similarity between IQ tests and these questions.

↑