Most active commenters
  • lelanthran(5)

←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 11 comments | | HN request time: 0.001s | source | bottom
Show context
mikeknoop ◴[] No.40712282[source]
(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #
1. lelanthran ◴[] No.40715468[source]
Maybe I am missing something, but to me this looks like "Let's brute-force on the training data".

I mean, generating tens of thousands of possible solutions, to find one that works does not, to me, signify AGI.

After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

The approach here, due to brute force, can't really scale: if a random solution to a very simple problem has a 1/10k chance of being right, you can't scale this up to non-trivial problems without exponentially increasing the computational power used. Hence, I feel this is brute-force.

replies(1): >>40716577 #
2. killerstorm ◴[] No.40716577[source]
10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Please learn a bit of combinatorics.

> After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

No. People have much better "early rejection", also human brain has massive parallel compute capacity.

It's ridiculous to demand GPT-4 performs as good as a human. Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

replies(1): >>40716765 #
3. lelanthran ◴[] No.40716765[source]
> 10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Brute searching literally means generating solutions until one works. Which is exactly what is being done here.

> Please learn a bit of combinatorics.

Don't be condescending - I understand the problem space just fine. Fine enough to realise that the problem was constructed specifically to ensure that "solutions" such as this just won't work.

Which is why this "solution" is straight-up broken (doesn't meet the target, exceeds the computationally bounds, etc).

> It's ridiculous to demand GPT-4 performs as good as a human.

Wasn't the whole point of this prize to spur interest in a new approach to learning? What does GPT-[1234] have to do with the contest rules? Especially since this solution broke those rules anyway?

> Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

That's precisely my point - it has to guess. Humans aren't guessing for those types of problems (not for the few that I saw anyway).

replies(2): >>40717295 #>>40718880 #
4. ealexhudson ◴[] No.40717295{3}[source]
I think to be clear, brute force generally means an iterative search of a solution space. I don't think that's what this system is doing, and it's not like it's following some search path and returning as early as possible.

It's similar that a lot of wrong answers are being thrown up, but I think this is more like a probabilistic system which is being pruned than a walk of the solution space. It's much smarter, but not as smart as we would like.

replies(1): >>40717802 #
5. lelanthran ◴[] No.40717802{4}[source]
> I think to be clear, brute force generally means an iterative search of a solution space.

Sure, but not an exhaustive one - you stop when you get a solution[1]. Brute force does not require an exhaustive search in order to be called brute-force.

GP was using the argument that because it is not exhaustive, it cannot be brute-force. That's the wrong argument. Brute-force doesn't have to be exhaustive to be brute-force.

[1] Or a good enough solution.

replies(1): >>40717996 #
6. naasking ◴[] No.40717996{5}[source]
A brute force search can be expected to find a solution after a more thorough search of the space of possibilities. If it really is only searching 0.000001% of that space before finding solutions, then some structure of the problem is guiding the search and it's no longer brute force.
7. lelanthran ◴[] No.40719551{4}[source]
> I studied algorithms for years.

Who hasn't?

> You're 100% WRONG on everything you wrote.

Maybe you should update the wikipedia page, then all the other textbooks, that uses a definition of brute-force that matches my understanding of it.

From https://en.wikipedia.org/wiki/Brute-force_search

> Therefore, brute-force search is typically used when the problem size is limited, or when there are problem-specific heuristics that can be used to reduce the set of candidate solutions to a manageable size.

Further, in the same page https://en.wikipedia.org/wiki/Brute-force_search#Speeding_up...

> One way to speed up a brute-force algorithm is to reduce the search space, that is, the set of candidate solutions, by using heuristics specific to the problem class.

I mean, the approach under discussion is literally exactly this.

Now, Mr "ACM ICPC, studied algorithms for years", where's your reference that reducing the solution space using heuristics results in a non-brute-force algorithm?

replies(2): >>40720086 #>>40721550 #
8. baobabKoodaa ◴[] No.40720086{5}[source]
This was extremely cringe worthy to read. You are confidently wrong about this. You are trying to redefine well established terminology. I don't care about whatever random wiki page you might find to "support your claims". Anybody who has worked a lot on algorithms (including myself) knows what brute force means and this is not it.

Also: lol at your "who hasn't" comment. Because you clearly haven't.

replies(1): >>40720217 #
9. lelanthran ◴[] No.40720217{6}[source]
> You are trying to redefine well established terminology.

Reference? Link, even?

> don't care about whatever random wiki page you might find to "support your claims".

That isn't some "random wiki" page; that's the wikipedia page for this specific term.

I'm not claiming to have defined this term, I'm literally saying I only agree with the sources for this term.

> Also: lol at your "who hasn't" comment. Because you clearly haven't.

Talk about cringe-worthy.

replies(1): >>40720602 #
10. baobabKoodaa ◴[] No.40720602{7}[source]
> Reference? Link, even?

Sure, here's definition for "brute force" from university textbook material written by pllk, who has taught algorithms for 20 years and holds a 2400 rating on Codeforces:

https://tira.mooc.fi/kevat-2024/osa9/

"Yleispätevä tapa ratkaista hakuongelmia on toteuttaa raakaan voimaan (brute force) perustuva haku, joka käy läpi kaikki ratkaisut yksi kerrallaan."

edit:

Here's an English language book written by the same author, though the English source does not precisely define the term:

https://cses.fi/book/book.pdf

In chapter 5:

"Complete search is a general method that can be used to solve almost any algorithm problem. The idea is to generate all possible solutions to the problem using brute force ..."

And a bit further down chapter 5:

"We can often optimize backtracking by pruning the search tree. The idea is to add ”intelligence” to the algorithm so that it will notice as soon as possible if a partial solution cannot be extended to a complete solution. Such optimizations can have a tremendous effect on the efficiency of the search."

Your mistake is that you for some reason believe that any search over solution space is a brute force solution. But there are many ways to search over a solution space. A "dumb search" over solution space is generally considered to be brute force, whereas a "smart search" is generally not considered to be brute force.

Here's the Codeforces profile of the author: https://codeforces.com/profile/pllk

edit 2:

Ok now I think I understand what causes your confusion. When an author writes "One way to speed up a brute-force algorithm ..." you think that the algorithm can still be called "brute force" after whatever optimizations were applied. No. That's not what that text means. This is like saying "One way to make a gray car more colorful is by painting it red". Is it still a gray car after it has been painted red? No it is not.

11. killerstorm ◴[] No.40721550{5}[source]
You're asking for a definition of exhaustive search. Exhaustive search, by definition, goes through the entire search space. That's what word exhaustive means.

For a reference, check Cormen's "Introduction to Algorithms". Every mention of brute-force search is specifically to exhaustive search is which not feasible for bigger spaces.

> I mean, the approach under discussion is literally exactly this.

It's literally not. It DOES NOT REDUCE the candidate set. It generates most likely candidates, but it doesn't reduce anything.

You lack basic understanding. Solutions are pixels grids, not Python programs. There's no search over pixel grids in the article. Not every search is exhaustive search.

This is like saying theoretical physicists are "brute-forcing" physics by generating candidate theories and testing them. Ridiculous.