Most active commenters
  • advael(6)
  • rfoo(4)

←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 12 comments | | HN request time: 0.001s | source | bottom
Show context
mikeknoop ◴[] No.40712282[source]
(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #
refreshingdrink ◴[] No.40714116[source]
Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

replies(1): >>40715655 #
rfoo ◴[] No.40715655[source]
... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

replies(2): >>40715788 #>>40715850 #
1. advael ◴[] No.40715850[source]
You seem to misunderstand why generalization is important for making claims about intelligent systems. To illustrate this, we could really easily design a system that encodes all the test set questions and their answers, puts them in an enormous hash table, and looks up the correct answer to each challenge when presented with it. This could probably score 100% on ARC if given the entire test set. Would you call this AGI? What if I put it through a transformer as a hashing function?

The mainstream attention LLMs have garnered has added a bunch of noise to the way we talk about machine learning systems, and unfortunately the companies releasing them are partially to blame for this. That doesn't mean we should change the definition of success for various benchmarks to better suit lay misunderstandings of how this all works

replies(1): >>40717004 #
2. rfoo ◴[] No.40717004[source]
First, LLMs are not AGI. Never will be. Can we talk now?

> if given the entire test set.

I don't want the entire test set. Or any single one in the test set.

The problem here is ARC challenge deliberately give a training set with different distribution than both the public and the private test set. It's like having only 1+1=2, 3+5=8, 9+9=18 in training set and then 1+9=10, 5*5=25, 16/2=8, (0!+0!+0!+0!)!=24 in test set.

I can see the argument of "giving the easy problems as demonstration of rules and then with 'intelligence' [1] you should be able to get harder ones (i.e. a different distribution)", but I don't believe it's a good way to benchmark current methods, mainly because there are shortcuts. Like I can teach my kids how factorial works and ! means factorial, instead of teaching them how addition works only and make them figure out how multiplication, division and factorial works and what's the notation.

[1] Whatever that means.

replies(3): >>40719825 #>>40720526 #>>40720676 #
3. advael ◴[] No.40719825[source]
Okay I admit I'm confused and think I probably missed a crucial thing here. You're saying the publicly available problem set isn't indicative of the distribution of the test set? If so, I can see why you object to that. Still, it's potentially possible that the test's intention is to demonstrate something like progressive integration of compositionality given an abstract model. A lot of machine learning systems can do well as long as they've seen an example similar to the problem they've been presented, but can't do things like respond to a situation that presents them with a novel composition of two abstractions they seem to have already learned in the way a human can trivially.

Like only having [1+1=2, 4+5=9, 2+10=12] in the training set and [2*5=10, 3/4=.75, 2^8=256] in the test set would be bad, but something like [1+1=2, 3+4*2=11, 5*3=15, 2*7=14, 1+3/5=1.8, 3^3=27] vs [2+4*3=14, 3+3^2+4=16, 2*3/4+2^3/2^4=2] might not be, depending on what they're trying to test

Compositionality of information, especially of abstractions (like rules or models of a phenomenon), is a key criterion in a lot of people's attempts to operationally define "intelligence" (which I agree is overall a nebulous and overloaded concept, but if we're going to make claims about it we need at least a working definition for any particular test we're doing) I could see that meaning that the test set problems need to be "harder" in the sense that presenting compositions of rules in training doesn't preclude memorizing the combinations. But this is just a guess, I'm not involved in ARC and don't know, obviously*

replies(1): >>40720838 #
4. blobbers ◴[] No.40720526[source]
The problem is there is no way to infer the right answer to 0! given the training. You need more context to learn it. Humans need more context. If you put that at the end of every grade 1 math test no student would get it right unless they had some context.

Do grade 1 kids have AGI? (Haha)

But seriously, all professions need to train in context to solve complex problems. You can train in adjacent realms and reason about problems but to truly perform, you need more training.

A general surgeon might be better than an electrician as a vet, but that I’d rather have a veterinary surgeon operate on my dog.

So some things are “AGI” able and other things need specific training.

replies(1): >>40720905 #
5. astromaniak ◴[] No.40720676[source]
> First, LLMs are not AGI.

It's the most generic thing we have right now, right?

> Never will be.

If there is no other breakthrough anytime soon we can engineer AGI-like things around LLMs. I mean LLM trained to use different attachments. Which can be other models and algorithms. Examples will be image recognition models and databases for algorithms. Even now ChatGPT can use Bing search and Python interpreter. First steps done, others will follow. The result will be not a true AGI, but still a very capable system. And there is another factor. Next models can be trained on high quality data generated by current models. Instead of internet random garbage. This should improve their spacial and logical abilities.

replies(1): >>40725199 #
6. rfoo ◴[] No.40720838{3}[source]
> You're saying the publicly available problem set isn't indicative of the distribution of the test set?

Yes. From https://arcprize.org/guide:

    Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.
    The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.
replies(1): >>40720966 #
7. advael ◴[] No.40720905{3}[source]
I think there's variance in people's degree of compositionality, as well as how quickly they can pick up on novel relationships. Testing "intelligence" in humans has always been kind of fraught in the first place, but any capability we may care to measure is going to permit degrees, and there will be some variance in humans on it. We should expect this. There's variance in goddam everything

We should also expect machine learning systems to have somewhat different properties from human minds. Like computers are more likely to accomplish perfect recall, and we can scale the size of their memory and their processing speed. All these confounding variables can make it hard to make binary tests of a capability, which is really what ARC seems like it's trying to do. One such capability that AI researchers will often talk about is conceptual compositionality. People care about compositionality because it's a good way to demonstrate that an abstract model is being used to reason about a situation, which can be used in unseen but perhaps conceptually similar situations. This "generalization" or "abstraction" capability is really the goal, but it's hard to reason about how to test it, and "composition" (That is, taking a situation that's novel, but a straightforward application of two or more different abstractions the agent should already "know") is one more testable way to try to tease it out.

As you point out, humans often fail this kind of test, and we can rightly claim that in those cases, they didn't correctly grasp the insight we were hoping they had. Testing distilled abstractions versus memorization or superficial pattern recognition isn't just important to AI research, it's also a key problem in lots of places in human education

8. advael ◴[] No.40720966{4}[source]
Well, in this paragraph they seem to explain that their public evaluation set is meant to be indicative of the kind of jump in difficulty you can expect from the private test set. This to me implies that my guess is close: They're looking for models that can learn simple concepts and apply them to complex problems. Keeping the test set private seems to be an attempt at making it difficult to "cheat" at this by simply memorizing superficial details of the more complex problem set, which makes sense given that the whole point of this seems to be testing for systems that can use learned abstractions to tackle novel, out-of-distribution problems

Like with our toy "algebra" examples, sure there's a lot of emphasis on repetition and rote in primary education on these subjects, and that's one way to get people more consistent at getting the calculations right, but to be frank I don't think it's the best way, or as crucial as it's made out to be. What someone really needs to understand about algebra is how the notation works and what the symbols mean. Like I can't unsee the concept of "+" as a function that takes two operands and starts counting for as many steps as one would in the right operand, starting at the value of the left operand. When looking at algebra, the process I go through relies on a bunch of conceptual frameworks, like "Anything in the set of all arabic numerals can be considered a literal value". "Anything in the roman alphabet is likely a variable". "Any symbol is likely an infix operator, that is, a function whose operands are on either side of it". Some of the concepts I'm using are just notational convention. At some point I memorized the set of arabic numerals, what they look like, what each of them means, how they're generally written in relation to each other to express quantities combinatorically. Some of the concepts are logical relations about quantities, or definitions of functions. But crucially, the form of these distillations makes them composable. If I didn't really understand what "+" does, then maybe someone could give me some really bad homework that goes

1 + 30 = 31

20 + 7 = 27

3 + 10 = 13

And then present me the problem

20 + 10 + 3 = ?

And I'd think the answer is

20 + 10 + 3 = 213

That demonstrates some model of how to do these calculations, but it doesn't really capture all the important relationships the symbols represent

We can have any number of objections to this training set. Like I wasn't presented with any examples of adding two-digit numbers together! OR even any examples where I needed to combine numbers in the same rank!

Definitely all true. Probably mistakes we could make in educating a kid on algebraic notation too. It's really hard to do these things in a way that's both accomplishing the goal and testable, quantifiable. But many humans demonstrate the ability to distill conceptual understanding of concepts without exhaustive examples of their properties, so that's one of the things ARC seems to want to test. It's hard to get this perfectly right, but it's a reasonable thing to want

replies(1): >>40725300 #
9. advael ◴[] No.40725199{3}[source]
The balance of evidence seems to suggest that training on model outputs leads to a significant degradation in model accuracy, some have even called this "model collapse." I'm not going to say it's impossible that this situation can improve, but your intuition that high quality generated outputs are an obvious means to bootstrap the next leap forward in model quality is definitely a contrarian position as I understand it
replies(1): >>40747829 #
10. rfoo ◴[] No.40725300{5}[source]
I agree. However, it is not a clear cut what's fair and what is "gaming the benchmark" in this setup, for example:

- can I train on my own private training set (which is harder)?

- can I pretrain on The Pile or something similar, a dataset full of texts crawled from web?

- can I pretrain on elementary school textbooks?

It seems like the latter two is acceptable given the use of GPT-4o here. But then, are the latter two that different to the first one? GPT-4o have the public test set in its training data (GPT-4o is definitely trained on public GitHub repos).

What's the point of having a training set with different distribution in this case, other than making participating harder? Maybe it's to discourage data-hungry approaches, but if there are legit shortcuts, anyone who seriously want to win would take it.

11. astromaniak ◴[] No.40747829{4}[source]
There was a report about training small model on eventset completely generated by GPT4. Small stories using kids vocabulary. Probably (half) a year back. So it's possible. My idea was to mix in significant portion of generated texts on logic and reasoning. Which would be expensive to create using humans, but much cheaper using GPT. But you are right, process is unstable and will (likely) collapse without some extra efforts. Mixing is one way. That will make it still respond correctly on original data.
replies(1): >>40757547 #
12. advael ◴[] No.40757547{5}[source]
There's a world of difference between machine teaching approaches that can create a less complex model from a more capable one and bootstrapping a more capable model from synthetic data. And don't get me wrong, it's still very useful to be able to distill models in this way! Like it's in many cases low hanging fruit for optimizing the parameter count or other resource bottlenecks of the models in question. Maybe the original learned representation wasn't the simplest neural network that could approximate the same function to the tolerance we care about. This streamlining can sometimes even induce distillation of certain abstractions, which I think has best been used in Motion Transfer results like Nvidia's Megaportraits or more recently Alibaba's EMO. However, if there's a scale-oriented path to language models that generalize better - or are more controllable, or just do better on established benchmarks - that is currently bottlenecked by available data, it seems unlikely that relying on synthetic data from extant models will get it over that hurdle, and this should kind of match your intuition if you're familiar with the information theory underlying statistical models, which neural networks of any kind are:

A model's predictions are necessarily going to be a compression of the data available to them, and so the hypothetical information-theoretic best case scenario is that a model trained on its own outputs, or even those of models trained in a similar way on similar volumes of data will generate diverse enough data to train a new model to replicate its own performance. In practice, this tends not to happen. Curation of available data can produce models with more focused distributions within the space of models we can feasibly train with the data and resources available, and you can use ensemble learning techniques or I guess stuff like RLHF (Which is kind of a silly framing of that concept as some RL people have pointed out, but it's the one people are familiar with now), but all of this is essentially just moving around in a pareto front which may not contain any "strictly better" model for whatever criteria we care about

I think the scaling laws of these things are running up against some fundamental limits in terms of useful diversity of available data and computational feasibility of meaningful improvements in scale. While hype likes to pretend that anything that happens fast for a while is "exponential", there are lots of other families of functions that appear to shoot upward before plateauing after hitting some fundamental limit, like a sigmoid! To me, it makes more intuitive sense that the capacity of a given model family will hit a plateau than continue scaling indefinitely, especially when we start to run up against dataset limits, and if there's enough more data than the current major tech companies can have already gotten their hands on to train with to a degree that makes a dent, I'd be shocked

That's not to say that impressive results aren't still happening, they're just mostly tackling different problems - various modality transfers, distillation-like improvements that make extant capability sets cheaper (in computational terms) to run, superficial capability shifts that better refine a language model to serve a particular use case, etc. LLMs in their current form are probably in need of another significant qualitative breakthrough to overcome their fundamental problems. They're clearly quite useful to a lot of people in their current form. They just don't live up to all this hype that's flying around.