Getting 50% (SoTA) on Arc-AGI with GPT-4o

1. extr ◴[17 Jun 24 22:42 UTC] No.40712008[source]▶

Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.

I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.

[0] https://x.com/eatpraydiehard/status/1632671307254099968

[1] https://x.com/eatpraydiehard/status/1632683214329479169

replies(2): >>40712644 #>>40716335 #

2. refulgentis ◴[18 Jun 24 00:00 UTC] No.40712644[source]▶

>>40712008 (TP) #

> brute forcing python solutions is a very "non human" approach.

ARC-AGI has odd features that leave me flummoxed by the naming and the attendant prize money and hype.

It is one singular task and frankly I strongly suspect someone could beat it within 30 days[1], in an unsatisfying way, as you note.

There's so much alpha that can be pieced together from here, ex. the last couple Google papers use the 1M context to do *500-shot*, i.e. 500 question answer examples. IIRC most recent showed raising travelling-salesman problem solve rate from 3 to 35%.

[1] I pre-registered this via a Twitter post, about 48 hours ago, i.e. before this result was announced.

replies(2): >>40712736 #>>40714197 #

3. elicksaur ◴[18 Jun 24 00:18 UTC] No.40712736[source]▶

>>40712644 #

The private test set has been available to crack for almost four years now. There was also a monetary prize competition run last year.

In your opinion, what has changed that would accelerate a solution to the next 30 days?

replies(1): >>40713193 #

4. refulgentis ◴[18 Jun 24 01:33 UTC] No.40713193{3}[source]▶

>>40712736 #

Prize money meant people would more cleverly strain the rule that "the private test set stays private, no GPT4o, Claude etc.", as shown by the TFA.

This sort of idea would then be shared openly on new sites, creating more attempts. Fallout I did not anticipate was getting widespread attentional on general tech news sites, and then getting public comment from a prize co-founder confirming it was acceptable.

replies(1): >>40713751 #

5. elicksaur ◴[18 Jun 24 03:14 UTC] No.40713751{4}[source]▶

>>40713193 #

It seems like you don’t understand the rules of the competition. Entries don’t have access to the internet. The OP acknowledges in their post that this is not eligible for the prize. The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized. (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

[1] https://arcprize.org/leaderboard

[2] https://lab42.global/community-interview-jack-cole/

replies(2): >>40714174 #>>40716472 #

6. refulgentis ◴[18 Jun 24 04:52 UTC] No.40714174{5}[source]▶

>>40713751 #

> It seems like you don’t understand the rules of the competition.

I don't think you "don't understand" anything :) I'd ask you, politely, to consider that when you're replying to other people in the future.

Better to bring to interactions the prior that your interlocutor is a presumably intelligent individual who can have a different interpretation of the same facts, than decide they just don't get it. The second is a quite lonely path.

> Entries don’t have access to the internet.

Correct. Per TFA, cofounder, Chollet, then me: this is an offline solution: the solution is the Python program found by an LLM.

> The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized.

Objection: relevancy? Is your claim here that it might be false so we shouldn't be discussing it at all?

> (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

I don't know what this means, "open LLM implementation" is either a term of art I don't recognize, or a misunderstanding of the situation.

I do assume you read the article, so I'm not trying to talk down to you, but to clarify:

The solution is the Python program, not the LLM prompts that iterated on a Python program. A common thread that would describe the confusing experience of reading your comment phrased aggressively and disputing everything up until you agree with me: your observations assume I assume the solution requires a cloud-based LLM to run. As noted above, it doesn't, which is also the thrust of my comment: they found a way to skirt what I thought the rules are, and the co-founder and Chollett have embraced it, publicly.

> There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

This was false before you posted, when I checked this morning, and it was false as early as 4 days ago, June 14th, we can confirm via archive.is. (prefix the URL you provided with archive.is/ to check for yourself)

> The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Did anyone claim it was?

> Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

I don't know what this means other than you're upset, but yes, sounds like both you and I agree that having an LLM generate Python programs isn't quite what we'd thought would be an AGI solution in the eyes of Chollet.

Alas, here we are.

replies(2): >>40714210 #>>40720941 #

7. nl ◴[18 Jun 24 04:58 UTC] No.40714197[source]▶

>>40712644 #

I don't think this is "non-satisfying" at all.

Program synthesis has been mentioned as a promising approach by François Chollet, and that's exactly what this is.

The place I find slightly unsatisfying is this:

> Sample vast, vast numbers of completions (~5,000 per problem) from GPT-4o.

> Take the most promising 12 completions for each problem, and then try to fix each by showing GPT-4o what this program actually outputs on the examples, and then asking GPT-4o to revise the code to make it correct. We sample ~3,000 completions that attempt to fix per problem in total across these 12 starting implementations.

I'd been tossing around a MCTS idea similar to AlphaGo, based on the idea that the end transformation is a series of sub-transformations. I feel like this could work well alongside the GPT-4o completion catalog. (This isn't an original observation or anything)

replies(2): >>40714631 #>>40716375 #

8. nl ◴[18 Jun 24 05:03 UTC] No.40714210{6}[source]▶

>>40714174 #

(Not the OP)

>> (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

> The solution is the Python program, not the LLM prompts that iterated on a Python program. A common thread that would describe the confusing experience of reading your comment phrased aggressively and disputing everything up until you agree with me: your observations assume I assume the solution requires a cloud-based LLM to run. As noted above, it doesn't, which is also the thrust of my comment: they found a way to skirt what I thought the rules are, and the co-founder and Chollett have embraced it, publicly.

I think the implication is that solutions that use an LLM via an API won't be eligible (the "no internet" rule).

This seems obvious to solve: can use GPT4 to generate catalogs in advance and a lesser, local LLM with good code abilities to select them.

I don't see why this skirts any rules you think were implied and I'm puzzled why you think it does.

> sounds like both you and I agree that having an LLM generate Python programs isn't quite what we'd thought would be an AGI solution in the eyes of Chollet.

> Alas, here we are.

Chollet noted that program synthesis was a promising approach, so it's not surprising to me that a program synthesis approach that also uses an LLM is effective.

9. bubblyworld ◴[18 Jun 24 06:25 UTC] No.40714631{3}[source]▶

>>40714197 #

Classic, I've been doing the same, writing an alphazero for the transformation part. What seems _much_ harder is picking a decent set of transformations/concepts to work with, or more generally automating that process. Maybe you're right that LLMs could help there!

replies(1): >>40716357 #

10. YeGoblynQueenne ◴[18 Jun 24 11:11 UTC] No.40716335[source]▶

>>40712008 (TP) #

>> GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

It still is. It misses the solution so comprehensively that it needs an outer loop to figure out which one is the solution out of 8k programs GPT-4o generates.

replies(1): >>40717373 #

11. luke-stanley ◴[18 Jun 24 11:14 UTC] No.40716357{4}[source]▶

>>40714631 #

Reminds me of NVIDIA Eureka: https://github.com/eureka-research/Eureka

replies(2): >>40716671 #>>40717870 #

12. YeGoblynQueenne ◴[18 Jun 24 11:15 UTC] No.40716375{3}[source]▶

>>40714197 #

>> Program synthesis has been mentioned as a promising approach by François Chollet, and that's exactly what this is.

To be precise, "this" -a bog-standard generate-and-test approach- is the dumbest possible way to do program synthesis. It's like sorting lists with bogosort and a very big computer.

It's exactly like bogosort: generate permutations and test. Except of course the system that generates permutations costs a few millions(?).

replies(1): >>40716650 #

13. YeGoblynQueenne ◴[18 Jun 24 11:27 UTC] No.40716472{5}[source]▶

>>40713751 #

I wish this comment was less confrontational because there's useful information in it and several points I agree with.

replies(1): >>40720803 #

14. bubblyworld ◴[18 Jun 24 11:47 UTC] No.40716650{4}[source]▶

>>40716375 #

Bogosort is driven by 0 heuristics - just shuffle and play. Using an LLM as a high-level prior over your search is very different, and the author had to do a lot of problem-specific tuning to make it work well.

replies(1): >>40720214 #

15. bubblyworld ◴[18 Jun 24 11:51 UTC] No.40716671{5}[source]▶

>>40716357 #

Very nice! Thanks for the link, that's great inspiration.

16. ealexhudson ◴[18 Jun 24 13:05 UTC] No.40717373[source]▶

>>40716335 #

We don't really know what GPT-4 "is". I remember reading a number of relatively well-informed suggestions that there are a number of a models inside there, and the API being interacted with is some form of outer-loop around them.

I don't think the location of the outer-loop or the design of it really makes much difference. There is no flock of birds without the individuals, the flock itself doesn't really exist as a tangible thing, but what arises out of the collective adjustments between all these individuals gives rise to a flock. Similarly, we may find groups of LLMs and various outer control loops give rise to an emergent phenomena much greater than the sum of their parts.

replies(1): >>40717527 #

17. YeGoblynQueenne ◴[18 Jun 24 13:20 UTC] No.40717527{3}[source]▶

>>40717373 #

>> We don't really know what GPT-4 "is".

Yes, we do. It's a language model.

18. nl ◴[18 Jun 24 13:51 UTC] No.40717870{5}[source]▶

>>40716357 #

Great link, thanks.

19. YeGoblynQueenne ◴[18 Jun 24 17:35 UTC] No.40720214{5}[source]▶

>>40716650 #

But he tuned the "test" side of the generate-and-test loop, not the "generate" side. The "generate" side remains a big permutation generator that is also very hard to control. The current highest-ranked system on the private test set of the ARC-AGI (at 34%) is another LLM fine-tuned on manually created examples of ARC tasks, so that would indeed be messing with the generator part of the loop. I'm guessing performance will jump when someone puts the two together.

A heuristic btw, is something completely different than fine tuning, or filtering. Heuristic search is the closest thing we have to an approximation of the kind of goal-driven behaviour we see in animal intelligence.

I think you could argue that gradient optimisation or any kind of optimisation of some kind of objective function is the same (Rich Sutton has a paper titled "Reward is all you need"). I'm not sure where I stand with that.

replies(1): >>40725485 #

20. elicksaur ◴[18 Jun 24 18:44 UTC] No.40720803{6}[source]▶

>>40716472 #

Hey! Genuinely, thank you for the feedback!

21. elicksaur ◴[18 Jun 24 18:58 UTC] No.40720941{6}[source]▶

>>40714174 #

From the leaderboard link (and on the archive version):

>ARC-AGI-Pub is a secondary leaderboard (in beta) measuring the public evaluation set. … The public evaluation set imposes no limitations on internet access or compute. At this time, ARG-AGI-Pub is not part of ARC Prize 2024 (eg. no prizes are associated with this leaderboard).

And, all the entries at time of writing and in the archive link say “You?…”. “ARC-AGI 2024 HIGH SCORES” which does have entries is on the private test set.

>I don't think you "don't understand" anything :)

I genuinely don’t understand if we are viewing the same websites.

replies(1): >>40721090 #

22. refulgentis ◴[18 Jun 24 19:11 UTC] No.40721090{7}[source]▶

>>40720941 #

> I genuinely don’t understand if we are viewing the same websites.

We are! I missed the nuance on you're looking for a public leaderboard on the private test set. I do see it now, but I'm still confused as to how that's relevant here.

23. bubblyworld ◴[19 Jun 24 06:37 UTC] No.40725485{6}[source]▶

>>40720214 #

There's a list of things the author did to change the "generate" side in the first two paragraphs of the article.

The heuristic isn't the fine-tuning, it's the actual LLM, which is clearly pruning the set of possibilities massively. That's a reasonably common usage of the word. I agree combining it with some kind of search would be interesting, but still I think you're being overly negative about the results here.

I'm actually busy training an alphazero for the arc problems, which I plan to try and hook up to a language model for reward generation, so we'll see how that fares!

I've read that paper, but thanks for the reference, this comment section is a goldmine.

replies(1): >>40726591 #

24. YeGoblynQueenne ◴[19 Jun 24 09:41 UTC] No.40726591{7}[source]▶

>>40725485 #

>> There's a list of things the author did to change the "generate" side in the first two paragraphs of the article.

I can't see where that is. All I can see the author saying they did is prompting and filtering of returned answers, none of which is going anywhere near the weights of the language model (that's where I'm claiming the "generator" is residing).

>> I'm actually busy training an alphazero for the arc problems, which I plan to try and hook up to a language model for reward generation, so we'll see how that fares!

That sounds exciting. Good luck with your effort!

replies(1): >>40730107 #

25. bubblyworld ◴[19 Jun 24 16:59 UTC] No.40730107{8}[source]▶

>>40726591 #

Yeah, you don't play with the weights in language models, you play with the residual stream by prompting (and occasionally by direct modification if you're being clever). But that does affect the model's generation! (obviously? otherwise there would be no need for a prompt in the first place, and all the recent residual stream modification research wouldn't work).

But I think if we just banned the word "generator" we probably wouldn't disagree on much here.

> Good luck with your effort!

Thanks =)