Most active commenters
  • CamperBob2(4)
  • HDThoreaun(3)

←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 18 comments | | HN request time: 0.001s | source | bottom
Show context
josalhor ◴[] No.46235005[source]
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #
verdverm ◴[] No.46235062[source]
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #
HDThoreaun ◴[] No.46235492[source]
Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.
replies(3): >>46236017 #>>46236535 #>>46236978 #
1. CamperBob2 ◴[] No.46236017{3}[source]
Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

replies(5): >>46236205 #>>46236247 #>>46236865 #>>46237072 #>>46237171 #
2. npinsker ◴[] No.46236205[source]
Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

replies(2): >>46236598 #>>46236995 #
3. esafak ◴[] No.46236247[source]
I would not be so sure. You can always prep to the test.
replies(1): >>46236590 #
4. HDThoreaun ◴[] No.46236590[source]
How do you prep for arc agi? If the answer is just "get really good at pattern recognition" I do not see that as a negative at all.
replies(1): >>46238125 #
5. CamperBob2 ◴[] No.46236598[source]
Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.

replies(1): >>46237004 #
6. jimbokun ◴[] No.46236865[source]
Is it different every time? Otherwise the training could just memorize the answers.
replies(1): >>46236877 #
7. CamperBob2 ◴[] No.46236877[source]
The models never have access to the answers for the private set -- again, at least in principle. Whether that's actually true, I have no idea.

The idea behind Arc-AGI is that you can train all you want on the answers, because knowing the solution to one problem isn't helpful on the others.

In fact, the way the test works is that the model is given several examples of worked solutions for each problem class, and is then required to infer the underlying rule(s) needed to solve a different instance of the same type of problem.

That's why comparing Arc-AGI to chess or other benchmaxxing exercises is completely off base.

(IMO, an even better test for AGI would be "Make up some original Arc-AGI problems.")

8. mrandish ◴[] No.46236995[source]
ARC-AGI was designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.
replies(1): >>46237068 #
9. fwip ◴[] No.46237004{3}[source]
Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.

Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.

replies(1): >>46237079 #
10. npinsker ◴[] No.46237068{3}[source]
I agree with you. I agree it's valuable work. I totally disagree with their claim.

A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.

Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.

replies(2): >>46239338 #>>46239562 #
11. FergusArgyll ◴[] No.46237072[source]
It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all
12. CamperBob2 ◴[] No.46237079{4}[source]
Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.

What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.

replies(1): >>46237199 #
13. ACCount37 ◴[] No.46237171[source]
With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.

14. fwip ◴[] No.46237199{5}[source]
Dunno :) I'm not an expert at LLMs or test design, I just see a lot of similarity between IQ tests and these questions.
15. ben_w ◴[] No.46238125{3}[source]
It can be not-negative without being sufficient.

Imagine that pattern recognition is 10% of the problem, and we just don't know what the other 90% is yet.

Streetlight effect for "what is intelligence" leads to all the things that LLMs are now demonstrably good at… and yet, the LLMs are somehow missing a lot of stuff and we have to keep inventing new street lights to search underneath: https://en.wikipedia.org/wiki/Streetlight_effect

replies(1): >>46238638 #
16. HDThoreaun ◴[] No.46238638{4}[source]
I dont think many people are saying 100% arc-agi 2 is equivalent to AGI(names are dumb as usual). Its just the best metric I have found, not the final answer. Spatial reasoning is an important part of intelligence even if it doesnt encompass all of it.
17. keeda ◴[] No.46239338{4}[source]
Maybe I'm misinterpreting your point, but this makes it seem that your standard for "intelligence" is "inventing entirely new techniques"? If so, it's a bit extreme, because to a first approximation, all problem solving is combining and applying existing techniques in novel ways to new situations.

At the point that you are inventing entirely new techniques, you are usually doing groundbreaking work. Even groundbreaking work in one field is often inspired by techniques from other fields. In the limit, discovering truly new techniques often requires discovering new principles of reality to exploit, i.e. research.

As you can imagine, this is very difficult and hence rather uncommon, typically only accomplished by a handful of people in any given discipline, i.e way above the standards of the general population.

I feel like if we are holding AI to those standards, we are talking about not just AGI, but artificial super-intelligence.

18. yovaer ◴[] No.46239562{4}[source]
> but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems

IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.