Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 3 comments | 17 Jun 24 21:51 UTC | HN request time: 0.423s | source

Show context

mikeknoop ◴[17 Jun 24 23:14 UTC] No.40712282[source]▶

(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #

refreshingdrink ◴[18 Jun 24 04:36 UTC] No.40714116[source]▶

>>40712282 #

Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

replies(1): >>40715655 #

rfoo ◴[18 Jun 24 09:28 UTC] No.40715655[source]▶

>>40714116 #

... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

replies(2): >>40715788 #>>40715850 #

YeGoblynQueenne ◴[18 Jun 24 09:51 UTC] No.40715788[source]▶

>>40715655 #

The problem with that it is we know approaches that can generalise very well from very few examples, even one example, without any kind of pretraining, That requires a good background theory of the target domain (a "world model" in more modern parlance), and we don't know how to automatically generate that kind of theory; only human minds can do it, for now. But given such a theory the number of examples needed can be as few as 1. Clearly, if you can learn from one example, but find yourself using thousands, you've taken a wrong turn somewhere.

The concern with the data-hungry approach to machine learning, that at least some of us have, is that it has given up on the effort to figure out how to learn good background theories and turned instead to getting the best performance possible in the dumbest possible way, relying on the largest available amount of examples and compute. That's a trend against everything else in computer science (and even animal intelligence) where the effort is to make everything smaller, cheaper, faster, smarter: it's putting all the eggs in the basket of making it big, slow and dumb, and hoping that this will somehow solve... intelligence. A very obvious contradiction.

Suppose we lived in a world that didn't have a theory of computational complexity and didn't know that some programs are more expensive to run than others. Would it be the case in that world, that computer scientists competed in solving ever larger instances of the Traveling Salesperson Problem, using ever larger computers, without even trying to find good heuristics exploiting the structure of the problem and simply trying to out-brute-force each other? That world would look a lot like where we are now with statistical machine learning: a pell-mell approach to throwing all resources at a problem that we just don't know how to solve, and don't even know if we can solve.

replies(2): >>40715903 #>>40716388 #

yccs27 ◴[18 Jun 24 11:18 UTC] No.40716388[source]▶

>>40715788 #

Sadly, right now the "throw lots of compute at it in the dumbest possible way" models work, and the "learn good background theories" approaches have gone nowhere. It's Rich Sutton's Bitter Lesson and a lot of us aren't ready to accept it.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

replies(1): >>40719660 #

1. lesuorac ◴[18 Jun 24 16:32 UTC] No.40719660[source]▶

>>40716388 #

> that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

Mostly tangential to the article but I never really like this argument. Like you're playing a game a specific way and somebody else comes in with a new approach and mops the floor with you and you're going to tell me "they played wrong"? Like no, you were playing wrong the whole time.

replies(2): >>40722227 #>>40723528 #

2. entropicdrifter ◴[18 Jun 24 21:11 UTC] No.40722227[source]▶

>>40719660 (TP) #

Yeah, people get salty when their preconceptions are shattered, especially when they've invested a lot of time/energy in thinking based on the idea that they were sound.

It goes beyond simple sunk cost and into the realm of reality slapping them with a harsh "humans aren't special, grow up", which I think is especially bitter for people who aren't already absurdists or nihilists.

3. YeGoblynQueenne ◴[19 Jun 24 00:35 UTC] No.40723528[source]▶

>>40719660 (TP) #

No the reason for the disappointment was that early AI pioneers considered chess a model of human intelligence and they expected a chess-playing AI to help them understand how human intelligence works. To have computer chess devolve into a race to beat human champions using techniques that only computers can use clearly defeated this purpose.

Those "early pioneers" were people like Alan Turing, Claude Shannon, Marvin Minsky, Donald Michie and John McCarthy, all of whom were chess players themselves and were prone to thinking of computer chess as a window into the inner workings of the human mind. Here's what McCarthy had to say when Deep Blue beat Kasparov:

In 1965 the Russian mathematician Alexander Kronrod said, "Chess is the Drosophila of artificial intelligence." However, computer chess has developed much as genetics might have if the geneticists had concentrated their efforts starting in 1910 on breeding racing Drosophila. We would have some science, but mainly we would have very fast fruit flies.

Three features of human chess play are required by computer programs when they face harder problems than chess. Two of them were used by early chess programs but were abandoned in substituting computer power for thought.

http://www-formal.stanford.edu/jmc/newborn/newborn.html

Then he goes on to discuss those three features of human chess play. It doesn't really matter which they are but it's clear that he is not complaining about anyone "playing wrong", he's complaining about computer chess taking a direction that fails to contribute to a scientific understanding of human, and I would also say machine, intelligence.

↑