Arc-AGI-2 and ARC Prize 2025

1. danpalmer ◴[25 Mar 25 00:27 UTC] No.43466879[source]▶

> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization

This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.

I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.

replies(1): >>43466922 #

2. fchollet ◴[25 Mar 25 00:33 UTC] No.43466922[source]▶

>>43466879 (TP) #

The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.

replies(2): >>43467080 #>>43467479 #

3. YeGoblynQueenne ◴[25 Mar 25 01:01 UTC] No.43467080[source]▶

>>43466922 #

>> The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

That's in no way different than claiming that LLMs understand language, or reason, etc, because they were designed that way.

Neural nets of all sorts have been beating benchmarks since forever, e.g. there's a ton of language understanding benchmarks pretty much all saturated by now (GLUE, SUPERGLUE ULTRASUPERAWESOMEGLUE ... OK I made that last one up) but passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Failing a benchmark also doesn't mean anything. A few years ago, at the first Kaggle competition, the entries were ad-hoc and amateurish. The first time a well-resourced team tried ARC (OpenAI) they ran roughshod over it and now you have to make a new one.

At some point you have to face the music: ARC is just another benchmark, destined to be beat in good time whenever anyone makes a concentrated effort at it and still prove nothing about intelligence, natural or artificial.

replies(2): >>43467150 #>>43467170 #

4. fchollet ◴[25 Mar 25 01:12 UTC] No.43467150{3}[source]▶

>>43467080 #

The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

By the time OpenAI attempted ARC in 2024, a colossal amount of resources had already been expended trying to beat the benchmark. The OpenAI run itself costs several millions in inference compute alone.

ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before. o3 is a case of a good approach meeting an appropriate benchmark, rather than an effort to beat ARC specifically.

replies(1): >>43475924 #

5. szvsw ◴[25 Mar 25 01:16 UTC] No.43467170{3}[source]▶

>>43467080 #

I mostly agree with what your are saying but…

> passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

Not agreeing or disagreeing or asking with skepticism. Genuinely asking what your position is here, since it seems like your comment eventually leads to the conclusion that it is unknowable whether a system external to yourself understands language, or, if it is possible, then only in a purely qualitative way, or perhaps purely in a Stewart-style-pornographic-threshold-test - you’ll know it when you see it.

I don’t have any problem if that’s your position- it might even be mine! I’m more or less of the mindset that debating whether artificial systems can have certain labels attached to them revolving around words like “understanding,” “cognition,” “sentience” etc is generally unhelpful, and it’s much more interesting to just talk about what the actual practical capabilities and functionalities of such systems are on the one hand in a very concrete, observable, hopefully quantitative sense, and how it feels to interact with them in a purely qualitative sense on the other hand. Benchmarks can be useful in the former but not the latter.

Just curious where you fall. How would you recommend we approach the desire to understand whether such systems can “understand language” or “solve problems” etc etc… or are these questions useless in your view? Or only useful in as much as they (the benchmarks/tests etc) drive the development of new methodologies/innovations/measurable capabilities, but not in assigning qualitative properties to said systems?

replies(1): >>43475882 #

6. danpalmer ◴[25 Mar 25 02:08 UTC] No.43467479[source]▶

>>43466922 #

Is there any other confirmation of the assumptions, other than the LLM behaviour, because that still feels like circular reasoning.

I think a similar claim could be levelled against other benchmarks or LLM evaluation tasks. One could say that the Turing test was designed to assess human intelligence, and LLMs pass it, therefore LLMs have human intelligence. This is generally considered to be false now, because we can plainly see that LLMs do not have intelligence in the same way as humans (yet? debatable, not the point), and instead we concluded that the Turing test was not the right benchmark. That's not to diminish its importance, it was hugely important as a part of AI education and possibly even AI development for decades.

ARC does seem to be pushing the boundaries, I'm just not convinced that it's testing a provable step change.

replies(1): >>43469115 #

7. JFingleton ◴[25 Mar 25 08:32 UTC] No.43469115{3}[source]▶

>>43467479 #

I'm not sure that's quite correct about the Turing test. From Wikipedia:

"Turing did not explicitly state that the Turing test could be used as a measure of "intelligence", or any other human quality. He wanted to provide a clear and understandable alternative to the word "think", which he could then use to reply to criticisms of the possibility of "thinking machines" and to suggest ways that research might move forward."

8. YeGoblynQueenne ◴[25 Mar 25 20:57 UTC] No.43475882{4}[source]▶

>>43467170 #

>> Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

I don't know and I don't have an opinion. I know that tests that claimed to measure language understanding, historically, haven't. There's some literature on the subject if you're curious (sounds like you are). I'd start here:

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Emily M. Bender, Alexander Koller

https://aclanthology.org/2020.acl-main.463/

Quoting the passage that I tend to remember:

>> While large neural LMs may well end up being important components of an eventual full-scale solution to human-analogous NLU, they are not nearly-there solutions to this grand challenge. We argue in this paper that genuine progress in our field — climbing the right hill, not just the hill on whose slope we currently sit —depends on maintaining clarity around big picture notions such as meaning and understanding in task design and reporting of experimental results.

9. YeGoblynQueenne ◴[25 Mar 25 21:01 UTC] No.43475924{4}[source]▶

>>43467150 #

>> The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

Which top lab was that? What did they try?

>> ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before.

Unfortunately observations support a simpler hypothesis: o3 was trained on sufficient data about ARC-1 that it could solve it well. There is currently insufficient data on ARC-II to solve it therefore o3 can't solve it. No super magickal and mysterious qualitatively different abilities to all models that came before required whatsoever.

Indeed, that is a common pattern in machine learning research: newer models perform better on benchmarks than earlier models not because their capabilities increase with respect to earlier models but because they're bigger models, trained on more data and more compute. They're just bigger, slower, more expensive- and just as dumb as their predecessors.

That's 90% of deep learning research in a nutshell.

replies(1): >>43479221 #

10. bubblyworld ◴[26 Mar 25 05:50 UTC] No.43479221{5}[source]▶

>>43475924 #

I'm sorry, but what observations support that hypothesis? There were scores of teams trying exactly that - training LLMs directly on Arc-AGI data - and by and large they achieved mediocre results. It just isn't an approach that works for this problem set.

To be honest your argument sounds like an attempt to motivate a predetermined conclusion.

replies(1): >>43498398 #

11. YeGoblynQueenne ◴[27 Mar 25 21:28 UTC] No.43498398{6}[source]▶

>>43479221 #

In which case what is the point of your comment? I mean what do you expect me to do after reading it, reach a different predetermined conclusion?

replies(1): >>43501898 #

12. bubblyworld ◴[28 Mar 25 05:36 UTC] No.43501898{7}[source]▶

>>43498398 #

Provide some evidence for your claims? This empty rhetoric stuff in every AI thread on HN wears me out a bit. I apologise for being a little aggressive in my previous comment.