Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 1 comments | 17 Jun 24 21:51 UTC | HN request time: 0.862s | source

Show context

mikeknoop ◴[17 Jun 24 23:14 UTC] No.40712282[source]▶

(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #

refreshingdrink ◴[18 Jun 24 04:36 UTC] No.40714116[source]▶

>>40712282 #

Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

replies(1): >>40715655 #

rfoo ◴[18 Jun 24 09:28 UTC] No.40715655[source]▶

>>40714116 #

... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

replies(2): >>40715788 #>>40715850 #

advael ◴[18 Jun 24 10:02 UTC] No.40715850[source]▶

>>40715655 #

You seem to misunderstand why generalization is important for making claims about intelligent systems. To illustrate this, we could really easily design a system that encodes all the test set questions and their answers, puts them in an enormous hash table, and looks up the correct answer to each challenge when presented with it. This could probably score 100% on ARC if given the entire test set. Would you call this AGI? What if I put it through a transformer as a hashing function?

The mainstream attention LLMs have garnered has added a bunch of noise to the way we talk about machine learning systems, and unfortunately the companies releasing them are partially to blame for this. That doesn't mean we should change the definition of success for various benchmarks to better suit lay misunderstandings of how this all works

replies(1): >>40717004 #

rfoo ◴[18 Jun 24 12:28 UTC] No.40717004[source]▶

>>40715850 #

First, LLMs are not AGI. Never will be. Can we talk now?

> if given the entire test set.

I don't want the entire test set. Or any single one in the test set.

The problem here is ARC challenge deliberately give a training set with different distribution than both the public and the private test set. It's like having only 1+1=2, 3+5=8, 9+9=18 in training set and then 1+9=10, 5*5=25, 16/2=8, (0!+0!+0!+0!)!=24 in test set.

I can see the argument of "giving the easy problems as demonstration of rules and then with 'intelligence' [1] you should be able to get harder ones (i.e. a different distribution)", but I don't believe it's a good way to benchmark current methods, mainly because there are shortcuts. Like I can teach my kids how factorial works and ! means factorial, instead of teaching them how addition works only and make them figure out how multiplication, division and factorial works and what's the notation.

[1] Whatever that means.

replies(3): >>40719825 #>>40720526 #>>40720676 #

blobbers ◴[18 Jun 24 18:09 UTC] No.40720526[source]▶

>>40717004 #

The problem is there is no way to infer the right answer to 0! given the training. You need more context to learn it. Humans need more context. If you put that at the end of every grade 1 math test no student would get it right unless they had some context.

Do grade 1 kids have AGI? (Haha)

But seriously, all professions need to train in context to solve complex problems. You can train in adjacent realms and reason about problems but to truly perform, you need more training.

A general surgeon might be better than an electrician as a vet, but that I’d rather have a veterinary surgeon operate on my dog.

So some things are “AGI” able and other things need specific training.

replies(1): >>40720905 #

1. advael ◴[18 Jun 24 18:54 UTC] No.40720905[source]▶

>>40720526 #

I think there's variance in people's degree of compositionality, as well as how quickly they can pick up on novel relationships. Testing "intelligence" in humans has always been kind of fraught in the first place, but any capability we may care to measure is going to permit degrees, and there will be some variance in humans on it. We should expect this. There's variance in goddam everything

We should also expect machine learning systems to have somewhat different properties from human minds. Like computers are more likely to accomplish perfect recall, and we can scale the size of their memory and their processing speed. All these confounding variables can make it hard to make binary tests of a capability, which is really what ARC seems like it's trying to do. One such capability that AI researchers will often talk about is conceptual compositionality. People care about compositionality because it's a good way to demonstrate that an abstract model is being used to reason about a situation, which can be used in unseen but perhaps conceptually similar situations. This "generalization" or "abstraction" capability is really the goal, but it's hard to reason about how to test it, and "composition" (That is, taking a situation that's novel, but a straightforward application of two or more different abstractions the agent should already "know") is one more testable way to try to tease it out.

As you point out, humans often fail this kind of test, and we can rightly claim that in those cases, they didn't correctly grasp the insight we were hoping they had. Testing distilled abstractions versus memorization or superficial pattern recognition isn't just important to AI research, it's also a key problem in lots of places in human education

↑