Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 4 comments | 17 Jun 24 21:51 UTC | HN request time: 0.97s | source

Show context

mikeknoop ◴[17 Jun 24 23:14 UTC] No.40712282[source]▶

(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #

YeGoblynQueenne ◴[18 Jun 24 08:57 UTC] No.40715482[source]▶

>>40712282 #

Ah, give it a rest. That's not "frontier AI research", neither is it any kind of reasoning. It's the dumbest of the dumb possible generate-and-test approach that spams a fire hose of Python programs until it hits one that works. And still it gets only 50% on the public eval.

How many thousands of Python programs does a human need to solve a single ARC task? That's what you get with reasoning: you don't need oodles of compute and boodles of sampling.

And I'm sorry to be so mean, but ARC is a farce. It's supposed to be a test for AGI but its only defense from a big data approach (what Francois calls "memorisation") is that there are few examples provided. That doesn't make the tasks hard to solve with memorisation it just makes it hard for a human researcher to find enough examples to solve with memorisation. Like almost every other AI-IQ test before it, ARC is testing for the wrong thing, with the wrong assumptions. See the Winograd Schema Challenge (but not yet the Bongard problems).

replies(3): >>40717360 #>>40719608 #>>40720800 #

jononor ◴[18 Jun 24 16:27 UTC] No.40719608[source]▶

>>40715482 #

Do you have any suggestions for a better approach of testing artificial intelligence? I mean, in a way that allows comparing different approaches and being a reasonable metric of progress.

replies(1): >>40720015 #

YeGoblynQueenne ◴[18 Jun 24 17:13 UTC] No.40720015[source]▶

>>40719608 #

I don't. I'm guessing -and it's nothing but a guess- that for every problem that can be solved with intelligence there exists a solution that does not require intelligence. I'm guessing in other words that intelligence is the ability to come up with solutions to arbitrary problems. If that's true then there's no way to test for intelligence by looking at the performance of a system at any particular task, or any finite set of tasks, and so there's no way to create a "test for intelligence".

My guess is supported by the experience that, in AI research, every time someone came up with a plausible test for intelligence, an AI system eventually passed the test only to make it clear that the test was not really testing intelligence after all (edit: I don't just mean formal tests; e.g. see how chess used to "require intelligence" right up until Deep Blue vs Kasparov).

Some people see that as "moving the goalposts" and it's certainly frustrating but the point is that we don't know what intelligence is, exactly, so it's very hard to test for its existence or not, or to measure it.

My preference would be for everyone in AI research to either stop what they're doing and try to understand what the hell intelligence is in the first place, to create a theory of intelligence so that AI can be a scientific subject again, or to at least admit they're not interested in creating artificial intelligence. I, for example, am not, but all my background is in subjects that are traditionally labelled "AI" so I have to suck it up, I guess.

replies(1): >>40720977 #

Nimitz14 ◴[18 Jun 24 19:01 UTC] No.40720977[source]▶

>>40720015 #

You're basically paraphrasing fchollet's paper on intelligence and what he talked about in his most recent podcast appearance with dwarkesh.

replies(1): >>40726505 #

YeGoblynQueenne ◴[19 Jun 24 09:29 UTC] No.40726505[source]▶

>>40720977 #

I watched the podcast you're referencing, on youtube, but I don't remember Chollet saying anything like what I say above.

Quite the contrary, Chollet seems convinced that a test for artificial intelligence, like an IQ test for AI, can be created and he has not only created one but also organised a Kaggle competition on it, and now is offering a $1 million prize to solve it. So how is anything he says or does compatible with what I say above, that it's likely there can't be a test for artificial intelligence?

replies(1): >>40760861 #

1. Nimitz14 ◴[22 Jun 24 18:01 UTC] No.40760861[source]▶

>>40726505 #

You clearly didn't listen much since Chollet's point is exactly that there is no one task that can test for AI. The test he created is supposed to take that into account.

replies(2): >>40761407 #>>40761459 #

2. ◴[22 Jun 24 19:19 UTC] No.40761407[source]▶

>>40760861 (TP) #

3. YeGoblynQueenne ◴[22 Jun 24 19:25 UTC] No.40761459[source]▶

>>40760861 (TP) #

First, don't be a jerk. Second, what's your problem? That you think I said that "no single task can't be used to test for AI"? I initially said:

>> "If that's true then there's no way to test for intelligence by looking at the performance of a system at any particular task, or any finite set of tasks, and so there's no way to create a "test for intelligence"."

Stress on or any finite set of tasks.

So, no, I didn't refer to a single task, if that's what you mean. What the hell do you mean and what the hell is your problem? Why is everyone always such a dick in this kind of discussion?

replies(1): >>40792883 #

4. Nimitz14 ◴[25 Jun 24 19:57 UTC] No.40792883[source]▶

>>40761459 #

Sorry. You, to me, came across as more interested in sharing your opinion than understanding what other people are saying. That's annoying. Maybe that's on me though.

Ok, you think no finite set of tasks can be used. Chollet is trying anyways. Maybe he is actually dynamically creating new tasks in the private set every time someone evaluates.

My main point was that I still think you're saying very similar things, quoting from the paper I mentioned:

> If a human plays chess at a high level, we can safely assume that this person is intelligent, because we implicitly know that they had to use their general intelligence to acquire this specific skill over their lifetime, which reflects their general ability to acquire many other possible skills in the same way. But the same assumption does not apply to a non human system that does not arrive at competence the way humans do. If intelligence lies in the process of acquiring skills then there is no task X such that skill at X demonstrates intelligence, unless X is a meta task involving skill acquisition across a broad range of tasks.

This to me sounds very similar to what you said:

> I'm guessing in other words that intelligence is the ability to come up with solutions to arbitrary problems.

And is also what collet talked about on the pod.

↑