Most active commenters
  • Chathamization(4)

←back to thread

188 points gkamradt | 13 comments | | HN request time: 1.71s | source | bottom
Show context
gkamradt ◴[] No.43465162[source]
Hey HN, Greg from ARC Prize Foundation here.

Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.

In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.

ARC-AGI-2 targets test-time reasoning.

My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.

Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.

Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.

Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated

The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.

The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition

We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.

Happy to answer questions.

replies(13): >>43465254 #>>43466394 #>>43466647 #>>43467579 #>>43467810 #>>43468015 #>>43468067 #>>43468081 #>>43468268 #>>43468318 #>>43468455 #>>43468706 #>>43468931 #
1. Chathamization ◴[] No.43468268[source]
> Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI.

I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.

ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.

replies(5): >>43468308 #>>43468359 #>>43468692 #>>43468875 #>>43471812 #
2. Palmik ◴[] No.43468308[source]
In your example you already indicated two tasks that you think might be hard for AI but easy for humans.

Who said that cooking dinner couldn't be part of ARC-AGI-<N>?

replies(1): >>43468338 #
3. Chathamization ◴[] No.43468338[source]
That’s precisely what I meant in my comment by “these types of tests.” People are eventually going to have some sort of standard for what they consider AGI. But that doesn’t mean the current benchmarks are useful for this task at all, and saying that the benchmarks could be completely different in the future only underscores this.
replies(1): >>43468361 #
4. yorwba ◴[] No.43468359[source]
The point isn't demonstrating AGI, but rather demonstrating that AGI definitely hasn't been reached yet.
5. pillefitz ◴[] No.43468361{3}[source]
They are useful to reach Arc-N+1
replies(1): >>43468411 #
6. Chathamization ◴[] No.43468411{4}[source]
How are any of these a useful path to asking an AI to cook dinner?

We already know many tasks that most humans can do relatively easily, yet most people don’t expect AI to be able to do them for years to come (for instance, L5 self-driving). ARC-AGI appears to be going in the opposite direction - can these models pass tests that are difficult for the average person to pass.

These benchmarks are interesting in that they show increasing capabilities of the models. But they seem to be far less useful at determining AGI than the simple benchmarks we’ve had all along (can these models do everyday tasks that a human can do?).

replies(2): >>43469091 #>>43476270 #
7. ◴[] No.43468692[source]
8. jononor ◴[] No.43468875[source]
The task you mention require intelligence but also a robot body with a lot of physical dexterity suited to a designed-for-humanoids world. That seems like an additional requirement on top of intelligence? Maybe we do not want an AGI definition to include that?

There are humans who cannot perform these tasks, at least without assistive/adapted systems such as a wheelchair and accessible bus.

replies(2): >>43469358 #>>43469425 #
9. fastball ◴[] No.43469091{5}[source]
The "everyday tasks" you specifically mention involve motor skills that are not useful for measuring intelligence.
10. aziaziazi ◴[] No.43469358[source]
I read that has “humans can perform these task, at least with…”

Put the computer in a wheelchair of his choice and let him try to catch the bus. How would you compare program and human reasoning abilities, but disregarding human ability to interact with the outside world?

Edit: Arc-AGI itself is only approachable by visually and manually valid humans, others needs assistive devices.

11. Chathamization ◴[] No.43469425[source]
> at least without assistive/adapted systems such as a wheelchair and accessible bus.

Which is precisely what the robotic body I mentioned would be.

You're talking about humans who have the mental capacity to do these things, but who don't control a body capable of doing them. That's the exact opposite of an AI that controls a body capable of doing these things, but lacks the mental capacity to do them.

12. CooCooCaCha ◴[] No.43471812[source]
The statement you quoted is a general statement, not specific to ARC-AGI.

The scenarios you listed are examples of what they’re talking about. Those are tasks that humans can easily do but robots have a hard time with.

13. mchusma ◴[] No.43476270{5}[source]
Genuine question, do you feel Waymo is not L5 self-driving? I Waymo has L5 but its not truly economic yet.