Arc-AGI-2 and ARC Prize 2025

(arcprize.org)

188 points gkamradt | 2 comments | 24 Mar 25 20:35 UTC | HN request time: 0.002s | source

Show context

danpalmer ◴[25 Mar 25 00:27 UTC] No.43466879[source]▶

> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization

This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.

I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.

replies(1): >>43466922 #

fchollet ◴[25 Mar 25 00:33 UTC] No.43466922[source]▶

>>43466879 #

The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.

replies(2): >>43467080 #>>43467479 #

YeGoblynQueenne ◴[25 Mar 25 01:01 UTC] No.43467080[source]▶

>>43466922 #

>> The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

That's in no way different than claiming that LLMs understand language, or reason, etc, because they were designed that way.

Neural nets of all sorts have been beating benchmarks since forever, e.g. there's a ton of language understanding benchmarks pretty much all saturated by now (GLUE, SUPERGLUE ULTRASUPERAWESOMEGLUE ... OK I made that last one up) but passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Failing a benchmark also doesn't mean anything. A few years ago, at the first Kaggle competition, the entries were ad-hoc and amateurish. The first time a well-resourced team tried ARC (OpenAI) they ran roughshod over it and now you have to make a new one.

At some point you have to face the music: ARC is just another benchmark, destined to be beat in good time whenever anyone makes a concentrated effort at it and still prove nothing about intelligence, natural or artificial.

replies(2): >>43467150 #>>43467170 #

fchollet ◴[25 Mar 25 01:12 UTC] No.43467150{3}[source]▶

>>43467080 #

The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

By the time OpenAI attempted ARC in 2024, a colossal amount of resources had already been expended trying to beat the benchmark. The OpenAI run itself costs several millions in inference compute alone.

ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before. o3 is a case of a good approach meeting an appropriate benchmark, rather than an effort to beat ARC specifically.

replies(1): >>43475924 #

YeGoblynQueenne ◴[25 Mar 25 21:01 UTC] No.43475924{4}[source]▶

>>43467150 #

>> The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

Which top lab was that? What did they try?

>> ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before.

Unfortunately observations support a simpler hypothesis: o3 was trained on sufficient data about ARC-1 that it could solve it well. There is currently insufficient data on ARC-II to solve it therefore o3 can't solve it. No super magickal and mysterious qualitatively different abilities to all models that came before required whatsoever.

Indeed, that is a common pattern in machine learning research: newer models perform better on benchmarks than earlier models not because their capabilities increase with respect to earlier models but because they're bigger models, trained on more data and more compute. They're just bigger, slower, more expensive- and just as dumb as their predecessors.

That's 90% of deep learning research in a nutshell.

replies(1): >>43479221 #

bubblyworld ◴[26 Mar 25 05:50 UTC] No.43479221{5}[source]▶

>>43475924 #

I'm sorry, but what observations support that hypothesis? There were scores of teams trying exactly that - training LLMs directly on Arc-AGI data - and by and large they achieved mediocre results. It just isn't an approach that works for this problem set.

To be honest your argument sounds like an attempt to motivate a predetermined conclusion.

replies(1): >>43498398 #

1. YeGoblynQueenne ◴[27 Mar 25 21:28 UTC] No.43498398{6}[source]▶

>>43479221 #

In which case what is the point of your comment? I mean what do you expect me to do after reading it, reach a different predetermined conclusion?

replies(1): >>43501898 #

2. bubblyworld ◴[28 Mar 25 05:36 UTC] No.43501898[source]▶

>>43498398 (TP) #

Provide some evidence for your claims? This empty rhetoric stuff in every AI thread on HN wears me out a bit. I apologise for being a little aggressive in my previous comment.

↑