←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 4 comments | | HN request time: 0.012s | source
Show context
mikeknoop ◴[] No.40712282[source]
(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #
refreshingdrink ◴[] No.40714116[source]
Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

replies(1): >>40715655 #
rfoo ◴[] No.40715655[source]
... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

replies(2): >>40715788 #>>40715850 #
advael ◴[] No.40715850[source]
You seem to misunderstand why generalization is important for making claims about intelligent systems. To illustrate this, we could really easily design a system that encodes all the test set questions and their answers, puts them in an enormous hash table, and looks up the correct answer to each challenge when presented with it. This could probably score 100% on ARC if given the entire test set. Would you call this AGI? What if I put it through a transformer as a hashing function?

The mainstream attention LLMs have garnered has added a bunch of noise to the way we talk about machine learning systems, and unfortunately the companies releasing them are partially to blame for this. That doesn't mean we should change the definition of success for various benchmarks to better suit lay misunderstandings of how this all works

replies(1): >>40717004 #
rfoo ◴[] No.40717004[source]
First, LLMs are not AGI. Never will be. Can we talk now?

> if given the entire test set.

I don't want the entire test set. Or any single one in the test set.

The problem here is ARC challenge deliberately give a training set with different distribution than both the public and the private test set. It's like having only 1+1=2, 3+5=8, 9+9=18 in training set and then 1+9=10, 5*5=25, 16/2=8, (0!+0!+0!+0!)!=24 in test set.

I can see the argument of "giving the easy problems as demonstration of rules and then with 'intelligence' [1] you should be able to get harder ones (i.e. a different distribution)", but I don't believe it's a good way to benchmark current methods, mainly because there are shortcuts. Like I can teach my kids how factorial works and ! means factorial, instead of teaching them how addition works only and make them figure out how multiplication, division and factorial works and what's the notation.

[1] Whatever that means.

replies(3): >>40719825 #>>40720526 #>>40720676 #
1. astromaniak ◴[] No.40720676[source]
> First, LLMs are not AGI.

It's the most generic thing we have right now, right?

> Never will be.

If there is no other breakthrough anytime soon we can engineer AGI-like things around LLMs. I mean LLM trained to use different attachments. Which can be other models and algorithms. Examples will be image recognition models and databases for algorithms. Even now ChatGPT can use Bing search and Python interpreter. First steps done, others will follow. The result will be not a true AGI, but still a very capable system. And there is another factor. Next models can be trained on high quality data generated by current models. Instead of internet random garbage. This should improve their spacial and logical abilities.

replies(1): >>40725199 #
2. advael ◴[] No.40725199[source]
The balance of evidence seems to suggest that training on model outputs leads to a significant degradation in model accuracy, some have even called this "model collapse." I'm not going to say it's impossible that this situation can improve, but your intuition that high quality generated outputs are an obvious means to bootstrap the next leap forward in model quality is definitely a contrarian position as I understand it
replies(1): >>40747829 #
3. astromaniak ◴[] No.40747829[source]
There was a report about training small model on eventset completely generated by GPT4. Small stories using kids vocabulary. Probably (half) a year back. So it's possible. My idea was to mix in significant portion of generated texts on logic and reasoning. Which would be expensive to create using humans, but much cheaper using GPT. But you are right, process is unstable and will (likely) collapse without some extra efforts. Mixing is one way. That will make it still respond correctly on original data.
replies(1): >>40757547 #
4. advael ◴[] No.40757547{3}[source]
There's a world of difference between machine teaching approaches that can create a less complex model from a more capable one and bootstrapping a more capable model from synthetic data. And don't get me wrong, it's still very useful to be able to distill models in this way! Like it's in many cases low hanging fruit for optimizing the parameter count or other resource bottlenecks of the models in question. Maybe the original learned representation wasn't the simplest neural network that could approximate the same function to the tolerance we care about. This streamlining can sometimes even induce distillation of certain abstractions, which I think has best been used in Motion Transfer results like Nvidia's Megaportraits or more recently Alibaba's EMO. However, if there's a scale-oriented path to language models that generalize better - or are more controllable, or just do better on established benchmarks - that is currently bottlenecked by available data, it seems unlikely that relying on synthetic data from extant models will get it over that hurdle, and this should kind of match your intuition if you're familiar with the information theory underlying statistical models, which neural networks of any kind are:

A model's predictions are necessarily going to be a compression of the data available to them, and so the hypothetical information-theoretic best case scenario is that a model trained on its own outputs, or even those of models trained in a similar way on similar volumes of data will generate diverse enough data to train a new model to replicate its own performance. In practice, this tends not to happen. Curation of available data can produce models with more focused distributions within the space of models we can feasibly train with the data and resources available, and you can use ensemble learning techniques or I guess stuff like RLHF (Which is kind of a silly framing of that concept as some RL people have pointed out, but it's the one people are familiar with now), but all of this is essentially just moving around in a pareto front which may not contain any "strictly better" model for whatever criteria we care about

I think the scaling laws of these things are running up against some fundamental limits in terms of useful diversity of available data and computational feasibility of meaningful improvements in scale. While hype likes to pretend that anything that happens fast for a while is "exponential", there are lots of other families of functions that appear to shoot upward before plateauing after hitting some fundamental limit, like a sigmoid! To me, it makes more intuitive sense that the capacity of a given model family will hit a plateau than continue scaling indefinitely, especially when we start to run up against dataset limits, and if there's enough more data than the current major tech companies can have already gotten their hands on to train with to a degree that makes a dent, I'd be shocked

That's not to say that impressive results aren't still happening, they're just mostly tackling different problems - various modality transfers, distillation-like improvements that make extant capability sets cheaper (in computational terms) to run, superficial capability shifts that better refine a language model to serve a particular use case, etc. LLMs in their current form are probably in need of another significant qualitative breakthrough to overcome their fundamental problems. They're clearly quite useful to a lot of people in their current form. They just don't live up to all this hype that's flying around.