Getting 50% (SoTA) on Arc-AGI with GPT-4o

Getting 50% (SoTA) on Arc-AGI with GPT-4o
mikeknoop ◴[] No.40712282[source]
(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

refreshingdrink ◴[] No.40714116[source]
Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set


> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

rfoo ◴[] No.40715655[source]
... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

YeGoblynQueenne ◴[] No.40715788[source]
The problem with that it is we know approaches that can generalise very well from very few examples, even one example, without any kind of pretraining, That requires a good background theory of the target domain (a "world model" in more modern parlance), and we don't know how to automatically generate that kind of theory; only human minds can do it, for now. But given such a theory the number of examples needed can be as few as 1. Clearly, if you can learn from one example, but find yourself using thousands, you've taken a wrong turn somewhere.

The concern with the data-hungry approach to machine learning, that at least some of us have, is that it has given up on the effort to figure out how to learn good background theories and turned instead to getting the best performance possible in the dumbest possible way, relying on the largest available amount of examples and compute. That's a trend against everything else in computer science (and even animal intelligence) where the effort is to make everything smaller, cheaper, faster, smarter: it's putting all the eggs in the basket of making it big, slow and dumb, and hoping that this will somehow solve... intelligence. A very obvious contradiction.

Suppose we lived in a world that didn't have a theory of computational complexity and didn't know that some programs are more expensive to run than others. Would it be the case in that world, that computer scientists competed in solving ever larger instances of the Traveling Salesperson Problem, using ever larger computers, without even trying to find good heuristics exploiting the structure of the problem and simply trying to out-brute-force each other? That world would look a lot like where we are now with statistical machine learning: a pell-mell approach to throwing all resources at a problem that we just don't know how to solve, and don't even know if we can solve.

advael ◴[] No.40715903[source]
The formalism that data-driven machine learning leans on is empirical tuning of stochastic search to drive approximation of functions, and despite what Silicon Valley would have you believe, most of the significant advances have been in creating useful meta-structures for modeling certain kinds of problems (e.g. convolution for efficiently processing transformations that care about local structure across dimensions of data, or qkv attention for keeping throughlines of non-local correspondences intact through a long sequence). Neural networks as a flavor of empirical function approximation happened to scale well, and then a bunch of people who saw how much this scale improved the models' capabilities but couldn't be bothered to understand the structural component concluded that scale somehow magically gets you to every unsolved problem being solved. It's also convenient for business types that if you buy this premise, any unicorn they want to promise is just a matter of throwing obscene amounts of resources at the problem (through their company of course)

I think probably the general idea of dynamic structures that are versatile in their ability to approximate functional models is at least a solid hypothesis for how some biological intelligence works at some level (I think maybe the "fluid/crystallized" intelligence distinction some psychology uses is informative here - a strong world model probably informs a lot of quick acquisition of relationships, but most intelligent systems clearly posess strong feedback mechanisms for capturing new models), though I definitely agree that a focus on how best to throw a ton of scale at these models doesn't seem like a fruitful path for actionably learning how to build or analyze intelligent systems in the way we usually think about, nor is it, well, sustainable. Moore's law appeals to business people because buying more computronium feels more like a predictable input-output relationship to put capital into, but even if we're just talking about raw computation speed advances in algorithms tend to dwarf advances in computing power in the long run. I think the same will hold true in AGI

1. YeGoblynQueenne ◴[] No.40720145[source]
Yeah, very good points. To be fair there are people who have argued the big data side who have clearly solid knowledge of AI and are not just SV suits, for example I remember Yann LeCun in a debate with Christopher Manning, where Manning was arguing for the importance of "structure" and LeCun was arguing against it. Or see the "Bitter Lesson", mentioned in a parent comment. That may have become a total shibboleth of the Silicon bros but Rich Sutton, who wrote the eponymous article, is the guy who wrote the book on Reinforcement Learning (literally). And then Rodney Brooks' replied with his "Better Lesson" (https://rodneybrooks.com/a-better-lesson/). So there's a lot of debate in this and I don't reckon we'll have a consensus soon. It should be clear which side I'm on- I work with firmly model-based AI ("planning is the model-based approach to autonomous behaviour" has become my shibboleth - see Bonnet and Geffner's book on planning: https://link.springer.com/book/10.1007/978-3-031-01564-9) so maybe it's a deformation professionelle. And even LCun's recent plans for JEPA are very consciously model-based, except he wants to learn his models from data; which is not a bad idea I suppose.
2. advael ◴[] No.40720643[source]
I've commented here before that I find myself really conflicted on LeCunn's public statements. I think it's really hard to reconcile the fact that he's undeniably a world-leading expert with the fact that he does work for and represent a tech company in a big way, which means that it's both hard to tell when what he says, especially publicly, is filtered through that lens, either explicitly or just via cultural osmosis. I know some people still in academia (e.g. "Bitter Lesson") are following suit but given how much of this field has been scooped up by large tech firms, this necessarily means that what we get out of research from those firms is partially filtered through them. Like it sounds like you're in CS/AI academia so I'm sure you're familiar with the distorting effect this brain drain has had on the field. Research out of places like FAIR or deepmind or OpenAI (arguably they were different until about 2019 or so? Hard to say how much of that was ever true unfortunately) are being done and published by world-leading experts hired by these companies and obviously this research has continued to be crucial to the field, but the fact that it's in industry means there's obviously controls on what they can publish, and the culture of an institution like Facebook is definitely going to have some different effects on priorities than that of most universities, and so while we can all collectively try to take it all with a grain of salt in some way, there is no way to be careful enough to avoid tribal knowledge in the field being heavily influenced by the cultures and priorities of these organizations.

But even if this kind of thinking is totally organic, I think it could arise from the delayed nature of the results of data-driven methods. Often a major structural breakthrough for a data-driven approach drastically predates the most obviously impactful results from that breakthrough, because the result impressive enough to draw people's attention comes from throwing lots of data and compute at the breakthrough. The people who got the impressive result might not even be the same team as the one that invented the structure they're relying on, and it's really easy to get the impression that what changed the game was the scale alone, I imagine even if you're on one of those research teams. I've been really impressed by some of the lines of research that show that you can often distill some of these results to not rely so heavily on massive datasets and enormous parallel training runs, and think we should properly view results that come from these to be demonstrations of the power of the underlying structural insights rather than new results. But I think this clashes with the organizational priorities of large tech firms, which often view scale as a moat, and thus are motivated to emphasize the need for it

3. barfbagginus ◴[] No.40721179[source]
The recent result shows SOTA progress from something as goofy as generating 5000 python programs until 0.06% of them pass the unit tests. We can imagine our own brains having a thousand random subconscious pre thoughts before our consciously registered though is chosen and amplified out of the hallucinatory subconscious noise. We're still at a point where we're making surprising progress from simple feedback loops, external tools and checkers, retries, backtracking, and other bells and whistles to the LLM model. Some of these even look like world models.

So maybe we can cure LLMs of the hallucinatory leprosy just by bathing them about 333 times in the mundane Jordan river of incremental bolt ons and modifications to formulas.

You should be able to think of the LLM as a random hallucination generator then ask yourself "how do I wire ten thousand random hallucination generators together into a brain?" It's almost certain that there's an answer... And it's almost certain that the answer is even going to be very simple in hindsight. Why? Because llms are already more versatile than the most basic components of the brain and we have not yet integrated them in the scale that components are integrated in the brain.

It's very likely that this is what our brains do at the component level - we run a bunch of feedback coupled hallucination generators that, when we're healthy, generates a balanced and generalizing consciousness - a persistent, reality coupled hallucinatory experience that we sense and interpret and work within as the world model. That just emerges from a network of self correcting natural hallucinators. For evidence, consider work in Cortical Columns and the Thousand brains theory. This suggests our brains have about a million Cortical Columns. Each loads up random inaccurate models of the world... And when we do integration and error correction over that, we get a high level conscious overlay. Sounds like what the author of the currently discussed SOTA did, but with far more sophistication. If the simplest most obvious approach to jamming 5,000 llms together into a brain gives us some mileage, then it's likely that more reasoned and intelligent approach could get these things doing feats like the fundamentally error prone components of our own brains can do when working together.

So I see absolutely no reason we couldn't build an analogy of that with llms as the base hallucinator. They are versatile and accurate enough. We could also use online training llms and working memory buffers as the base components of a Jepa model.

It's pretty easy to imagine that a society of 5000 gpt4 hallucinators could, with the right self administered balances and utilities, find the right answers. That's what the author did to win the 50%.

Therefore I propose that for the current generation it's okay to just mash a bunch of hallucinators together and whip them into the truth. We should be able to do it because our brains have to be able to do it. And if you're really smart, you will find a very efficient mathematical decomposition... Or a totally new model. But for every current LLM inability, it's likely to turn out that sequence of simple modifications can solve it. Will probably accrue a large number of such modifications before someone comes along and thinks of an all-new model then does way better, perhaps taking inspirations from the proposed solutions, or perhaps exploring the negative space around those solutions.

4. YeGoblynQueenne ◴[] No.40723513[source]
Absolutely, industry and its neverending piggy bank have had a severe distorting effect on the direction of research. I'm a post-doc btw, right now working on robotic autonomy. I don't have direct experience of the brain drain- I'm in a UK university- but I can see the obvious results in the published research which has very suddenly lurched towards LLMs recently, as it did a very sudden lurch towards CNNs after 2012 etc.

Like you say, large tech corps clearly see big data approaches as a moat, as a game that they can play better than anyone else: they got the data, they got the compute, and they got the millions to hoover up all the "talent". Obviously, when it's corporations driving research they are not going to drive it towards a deepening of understanding and an enriching of knowledge, the only thing they care about is selling stuff to make money, and to hell with whether that stuff works or not and why. I'm worried even that this is going to have a degrading effect on the output of science and technology in general, not just AI and CS. It's like a substantial minority of many fields of science have given up on basic research and are instead feeding data to big neural nets and poking LLMs to see what will fall out. This is a very bad situation. Not a winter but an Eternal Summer.

Take it away, Tom.


5. YeGoblynQueenne ◴[] No.40723557[source]
Thanks for the comment but I have to say: woa there, hold your horses. Hallucinations as the basis of intelligence? Why?

Think about it this way: ten years ago, would you think that hallucinations have anything to do with intelligence? If it were 2012, would you think that convolutions, or ReLus, are the basis of intelligence instead?

I'm saying there is a clear tendency within AI research, and without, to assume that whatever big new idea is currently trending is "it" and that's how we solve AI. Every generation of AI reseachers since the 1940's has fallen down that pit. In fact, no lesser men than Walter Pitts and Warren McCulloch, the inventors of the artificial neuron in 1943, firmly believed that the basis of intelligence is propositional logic. That's right. Propositional logic. That was the hot stuff at the time. Besides, the first artificial neuron was a propositional logic circuit that learned its own boolean function.

So keep an eye out for being carried away on the wings of the latest hype and thinking we got the solution to every problem just because we can do yet another thing, with computers, that we couldn't do before.

6. advael ◴[] No.40729684{3}[source]
Hot girl summer is cancelled we got hot GPU trying to bear the weight of humanity's hopes and dreams as they collapse into a single point summer

Hot market forces treated as inevitable as the ever-rising tides summer

Hot war with nuclear powers looming as a possibility on the world stage even as one such power's favored information warfare strategy of flooding all communication channels with noise becomes ever more indistinguishable from those channels' normal state summer

In a mad world, heavy metal oscillates between states of catharsis and prophecy

Anyway I really appreciate your taking the time to respond thoughtfully and am trying to channel your patient approach in my endeavors today. Hope your summer's going well, despite the looming threat of its eternity