Arc-AGI-2 and ARC Prize 2025

1. gkamradt ◴[24 Mar 25 20:37 UTC] No.43465162[source]▶

Hey HN, Greg from ARC Prize Foundation here.

Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.

In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.

ARC-AGI-2 targets test-time reasoning.

My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.

Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.

Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.

Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated

The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.

The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition

We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.

Happy to answer questions.

replies(13): >>43465254 #>>43466394 #>>43466647 #>>43467579 #>>43467810 #>>43468015 #>>43468067 #>>43468081 #>>43468268 #>>43468318 #>>43468455 #>>43468706 #>>43468931 #

2. artninja1988 ◴[24 Mar 25 20:47 UTC] No.43465254[source]▶

>>43465162 (TP) #

What are you doing to prevent the test set being leaked? Will you still be offering API access to the semi private test set to the big model providers who presumably train on their API?

replies(1): >>43465360 #

3. gkamradt ◴[24 Mar 25 21:00 UTC] No.43465360[source]▶

>>43465254 #

We have a few sets:

1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public

So for those two we don't have protections.

3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.

4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.

replies(1): >>43466602 #

4. gmkhf ◴[24 Mar 25 23:13 UTC] No.43466394[source]▶

>>43465162 (TP) #

I think a lot of people got discouraged, seeing how openai solved arc agi 1 by what seems like brute forcing and throwing money at it. Do you believe arc was solved in the "spirit" of the challenge? Also all the open sourced solutions seem super specific to solving arc. Is this really leading us to human level AI at open ended tasks?

replies(1): >>43466786 #

5. zamadatix ◴[24 Mar 25 23:42 UTC] No.43466602{3}[source]▶

>>43465360 #

What prevents everything in 4 from becoming a part of 3 the first time the test set is run on a proprietary model, do you require competitors like OpenAI provide models Kaggle can self host for the test?

replies(1): >>43466626 #

6. gkamradt ◴[24 Mar 25 23:47 UTC] No.43466626{4}[source]▶

>>43466602 #

#4 (private test set) doesn't get used for any public model testing. It is only used on the Kaggle leaderboard where no internet access is allowed.

replies(1): >>43466669 #

7. synapsomorphy ◴[24 Mar 25 23:51 UTC] No.43466647[source]▶

>>43465162 (TP) #

Thanks for your awesome work Greg!

The success of o3 directly contradicts us being in an "idea-constrained environment", what makes you believe that?

replies(2): >>43468917 #>>43469112 #

8. zamadatix ◴[24 Mar 25 23:54 UTC] No.43466669{5}[source]▶

>>43466626 #

Sorry, I probably phrased the question poorly. My question is more along the lines of "when you already scored e.g. OpenAI's o3 on ARC AGI 2 how did you guarantee OpenAI can't just look at its server logs to see question set 4"?

replies(1): >>43466703 #

9. gkamradt ◴[24 Mar 25 23:58 UTC] No.43466703{6}[source]▶

>>43466669 #

Ah yes, two things

1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

2. We only tested o3 against the semi-private set. We didn't test it with the private eval.

replies(3): >>43466704 #>>43467095 #>>43467449 #

10. zamadatix ◴[24 Mar 25 23:59 UTC] No.43466704{7}[source]▶

>>43466703 #

Makes sense, particularly part 2 until "the final results" are needed. Thanks for taking the time to answer my question!

11. jmtulloss ◴[25 Mar 25 00:11 UTC] No.43466786[source]▶

>>43466394 #

Why is this the same comment as https://news.ycombinator.com/item?id=43466406?

12. YeGoblynQueenne ◴[25 Mar 25 01:03 UTC] No.43467095{7}[source]▶

>>43466703 #

>> We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

Yuri Geller assured us he was bending the spoons with his mind. Somehow it was only when the Amazing Randi was present that Yuri Geller couldn't bend the spoons with his mind.

replies(1): >>43467979 #

13. QuadmasterXLII ◴[25 Mar 25 02:02 UTC] No.43467449{7}[source]▶

>>43466703 #

Are you aware that OpenAI brazenly lied and went back on its word about its corporate structure, board governance, and for-profit status, and of the opinion that your data sharing agreement is different and less likely to be ignored? Or are you at step zero where you aren’t considering malfeasance as a possibility at all?

14. vessenes ◴[25 Mar 25 02:26 UTC] No.43467579[source]▶

>>43465162 (TP) #

Just want to say I really love these new problems - feels like some general intelligence went into conceiving of and creating these puzzles: we just did a few over dinner as a family.

You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!

15. az226 ◴[25 Mar 25 03:16 UTC] No.43467810[source]▶

>>43465162 (TP) #

Did any single individual solve all problems? How many such individuals were there?

16. levocardia ◴[25 Mar 25 03:54 UTC] No.43467979{8}[source]▶

>>43467095 #

Ironically "I have a magic AI test but nobody is allowed to use it" is a lot closer to the Yuri Geller situation. Tests are meant to be taken, that should be clear. And...maybe this does not apply in the academic domain, but to some extent if you cheat on an AI test "you're only cheating yourself."

replies(1): >>43468143 #

17. levocardia ◴[25 Mar 25 03:59 UTC] No.43468015[source]▶

>>43465162 (TP) #

I'm really pleased to see this! The original ARC-AGI-1 paper still informs how I think about "what is intelligence" today. I was thrilled to see AI models make real progress on that test precisely when we had the next big idea (reasoning). Here's to hoping round 2 falls with a similarly big breakthrough!

18. tananaev ◴[25 Mar 25 04:12 UTC] No.43468067[source]▶

>>43465162 (TP) #

Did I read this right that only 2 humans out of 400 solved the problems?

replies(2): >>43468176 #>>43469493 #

19. doctorpangloss ◴[25 Mar 25 04:16 UTC] No.43468081[source]▶

>>43465162 (TP) #

Why doesn’t every blogpost contain an example of a question you ask?

20. Jensson ◴[25 Mar 25 04:29 UTC] No.43468143{9}[source]▶

>>43467979 #

> but to some extent if you cheat on an AI test "you're only cheating yourself."

You cheat investors.

replies(1): >>43468660 #

21. trott ◴[25 Mar 25 04:41 UTC] No.43468176[source]▶

>>43468067 #

They started with N >= 120x3 tasks, and gave each task to 4-9 humans. Then they kept only those 120x3 tasks that at least 2 humans had solved.

replies(1): >>43468272 #

22. Chathamization ◴[25 Mar 25 05:08 UTC] No.43468268[source]▶

>>43465162 (TP) #

> Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI.

I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.

ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.

replies(5): >>43468308 #>>43468359 #>>43468692 #>>43468875 #>>43471812 #

23. tananaev ◴[25 Mar 25 05:09 UTC] No.43468272{3}[source]▶

>>43468176 #

That's a very small sample size by task. I wonder if they give the whole data set to an average human, what the result would be. I tried some simple tasks and they are doable, but I couldn't figure out the hard ones.

24. Palmik ◴[25 Mar 25 05:19 UTC] No.43468308[source]▶

>>43468268 #

In your example you already indicated two tasks that you think might be hard for AI but easy for humans.

Who said that cooking dinner couldn't be part of ARC-AGI-<N>?

replies(1): >>43468338 #

25. Centigonal ◴[25 Mar 25 05:20 UTC] No.43468318[source]▶

>>43465162 (TP) #

Thank you for including cost (or really any proxy for efficiency) as a dimension to this prize!

26. Chathamization ◴[25 Mar 25 05:24 UTC] No.43468338{3}[source]▶

>>43468308 #

That’s precisely what I meant in my comment by “these types of tests.” People are eventually going to have some sort of standard for what they consider AGI. But that doesn’t mean the current benchmarks are useful for this task at all, and saying that the benchmarks could be completely different in the future only underscores this.

replies(1): >>43468361 #

27. yorwba ◴[25 Mar 25 05:32 UTC] No.43468359[source]▶

>>43468268 #

The point isn't demonstrating AGI, but rather demonstrating that AGI definitely hasn't been reached yet.

28. pillefitz ◴[25 Mar 25 05:32 UTC] No.43468361{4}[source]▶

>>43468338 #

They are useful to reach Arc-N+1

replies(1): >>43468411 #

29. Chathamization ◴[25 Mar 25 05:46 UTC] No.43468411{5}[source]▶

>>43468361 #

How are any of these a useful path to asking an AI to cook dinner?

We already know many tasks that most humans can do relatively easily, yet most people don’t expect AI to be able to do them for years to come (for instance, L5 self-driving). ARC-AGI appears to be going in the opposite direction - can these models pass tests that are difficult for the average person to pass.

These benchmarks are interesting in that they show increasing capabilities of the models. But they seem to be far less useful at determining AGI than the simple benchmarks we’ve had all along (can these models do everyday tasks that a human can do?).

replies(2): >>43469091 #>>43476270 #

30. az226 ◴[25 Mar 25 05:55 UTC] No.43468455[source]▶

>>43465162 (TP) #

Which puzzles had the lowest solve rate? I did the first 10 and felt all easy (mentally solve it in 10-20 seconds for easier ones and 30-60 seconds for harder ones), I’d like to try the most difficult ones.

31. anshumankmr ◴[25 Mar 25 06:59 UTC] No.43468660{10}[source]▶

>>43468143 #

And end users and developers and the general public too...

But here is the thing, I feel that even if its rote memorizing why GPT4o couldn't perform just as well on ArcAGI 1 on it or did the "reasoning" help in any way?

32. ◴[25 Mar 25 07:07 UTC] No.43468692[source]▶

>>43468268 #

33. Nuzzerino ◴[25 Mar 25 07:10 UTC] No.43468706[source]▶

>>43465162 (TP) #

Why wasn’t the ICOM framework (D. Kelley) allowed to make a scoring submission after they claimed to have beaten the scores? Are you concerned that may appear to contradict your mission statement and alienate the AGI community?

34. jononor ◴[25 Mar 25 07:51 UTC] No.43468875[source]▶

>>43468268 #

The task you mention require intelligence but also a robot body with a lot of physical dexterity suited to a designed-for-humanoids world. That seems like an additional requirement on top of intelligence? Maybe we do not want an AGI definition to include that?

There are humans who cannot perform these tasks, at least without assistive/adapted systems such as a wheelchair and accessible bus.

replies(2): >>43469358 #>>43469425 #

35. littlestymaar ◴[25 Mar 25 07:59 UTC] No.43468917[source]▶

>>43466647 #

What makes you think so?

From ChatGPT 3.5 to o1, all LLMs progress came from investment in training: either by using much more data, or using higher quality data thanks to artificial data.

o1 (and then o3) broke this paradigm by applying a novel idea (RL+search on CoT) and that's because of it that it was able to make progress on ARC-AGI.

So IMO the success of o3 goes in favor of the argument of how we are in an idea-constrained environment.

replies(1): >>43469981 #

36. ustad ◴[25 Mar 25 08:02 UTC] No.43468931[source]▶

>>43465162 (TP) #

Using AGI in the titles of your tests might not be accurate or appropriate. May I suggest NAI - Narrow AI?

replies(1): >>43469023 #

37. JFingleton ◴[25 Mar 25 08:19 UTC] No.43469023[source]▶

>>43468931 #

My prediction: we'll be arguing about what AGI actually is... Forever.

replies(1): >>43469094 #

38. fastball ◴[25 Mar 25 08:28 UTC] No.43469091{6}[source]▶

>>43468411 #

The "everyday tasks" you specifically mention involve motor skills that are not useful for measuring intelligence.

39. throwuxiytayq ◴[25 Mar 25 08:29 UTC] No.43469094{3}[source]▶

>>43469023 #

Or depending on your outlook, for a couple of years, and then we will no longer be participating in these or any other cognitive exercises.

40. jononor ◴[25 Mar 25 08:31 UTC] No.43469112[source]▶

>>43466647 #

Not Greg/team, so unrelated opinion. o3 solution for ARC v1 was incredibly expensive. Some good ideas are at least needed to take that cost down by a factor 100-10000x.

replies(1): >>43469964 #

41. aziaziazi ◴[25 Mar 25 09:22 UTC] No.43469358{3}[source]▶

>>43468875 #

I read that has “humans can perform these task, at least with…”

Put the computer in a wheelchair of his choice and let him try to catch the bus. How would you compare program and human reasoning abilities, but disregarding human ability to interact with the outside world?

Edit: Arc-AGI itself is only approachable by visually and manually valid humans, others needs assistive devices.

42. Chathamization ◴[25 Mar 25 09:35 UTC] No.43469425{3}[source]▶

>>43468875 #

> at least without assistive/adapted systems such as a wheelchair and accessible bus.

Which is precisely what the robotic body I mentioned would be.

You're talking about humans who have the mental capacity to do these things, but who don't control a body capable of doing them. That's the exact opposite of an AI that controls a body capable of doing these things, but lacks the mental capacity to do them.

43. mapmeld ◴[25 Mar 25 09:46 UTC] No.43469493[source]▶

>>43468067 #

No, they're saying that the problems have been reviewed / play-tested by ≥2 humans, so they are not considered unfair or too ambiguous to solve in two attempts (a critique of some Arc-AGI-1 puzzles that o3 missed). They have a lot of puzzles so they were divided among some number of testers, but I don't think every tester had to try every problem.

44. torginus ◴[25 Mar 25 11:25 UTC] No.43469964{3}[source]▶

>>43469112 #

Yeah my analogy for that solution is like claiming to have solved sorting arrays by using enormous compute to try all possible orderings of arrays of length 100.

It's not a real solution because:

- It's way too expensive

- It doesn't scale the way a real solution does

45. torginus ◴[25 Mar 25 11:28 UTC] No.43469981{3}[source]▶

>>43468917 #

This isn't a novel idea - some people tried the exact same thing the day GPT4 came out.

And going back even further, there's Goal Oriented Action Planning - an old timey video game AI technique, that's basically searching through solution space to construct a plan:

https://medium.com/@vedantchaudhari/goal-oriented-action-pla...

(besides the fact that almost all old timey AI is state space solution search)

replies(1): >>43480778 #

46. CooCooCaCha ◴[25 Mar 25 14:31 UTC] No.43471812[source]▶

>>43468268 #

The statement you quoted is a general statement, not specific to ARC-AGI.

The scenarios you listed are examples of what they’re talking about. Those are tasks that humans can easily do but robots have a hard time with.

47. mchusma ◴[25 Mar 25 21:36 UTC] No.43476270{6}[source]▶

>>43468411 #

Genuine question, do you feel Waymo is not L5 self-driving? I Waymo has L5 but its not truly economic yet.

48. littlestymaar ◴[26 Mar 25 10:51 UTC] No.43480778{4}[source]▶

>>43469981 #

What's new is to apply that to LLMs, that is.

> This isn't a novel idea - some people tried the exact same thing the day GPT4 came out.

What do you mean? Since GPT4's weights aren't available, you can't run RL on it by yourself. Only OpenAI can.