Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 4 comments | 17 Jun 24 21:51 UTC | HN request time: 0.645s | source

Show context

extr ◴[17 Jun 24 22:42 UTC] No.40712008[source]▶

Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.

I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.

[0] https://x.com/eatpraydiehard/status/1632671307254099968

[1] https://x.com/eatpraydiehard/status/1632683214329479169

replies(2): >>40712644 #>>40716335 #

refulgentis ◴[18 Jun 24 00:00 UTC] No.40712644[source]▶

>>40712008 #

> brute forcing python solutions is a very "non human" approach.

ARC-AGI has odd features that leave me flummoxed by the naming and the attendant prize money and hype.

It is one singular task and frankly I strongly suspect someone could beat it within 30 days[1], in an unsatisfying way, as you note.

There's so much alpha that can be pieced together from here, ex. the last couple Google papers use the 1M context to do *500-shot*, i.e. 500 question answer examples. IIRC most recent showed raising travelling-salesman problem solve rate from 3 to 35%.

[1] I pre-registered this via a Twitter post, about 48 hours ago, i.e. before this result was announced.

replies(2): >>40712736 #>>40714197 #

elicksaur ◴[18 Jun 24 00:18 UTC] No.40712736[source]▶

>>40712644 #

The private test set has been available to crack for almost four years now. There was also a monetary prize competition run last year.

In your opinion, what has changed that would accelerate a solution to the next 30 days?

replies(1): >>40713193 #

refulgentis ◴[18 Jun 24 01:33 UTC] No.40713193[source]▶

>>40712736 #

Prize money meant people would more cleverly strain the rule that "the private test set stays private, no GPT4o, Claude etc.", as shown by the TFA.

This sort of idea would then be shared openly on new sites, creating more attempts. Fallout I did not anticipate was getting widespread attentional on general tech news sites, and then getting public comment from a prize co-founder confirming it was acceptable.

replies(1): >>40713751 #

elicksaur ◴[18 Jun 24 03:14 UTC] No.40713751[source]▶

>>40713193 #

It seems like you don’t understand the rules of the competition. Entries don’t have access to the internet. The OP acknowledges in their post that this is not eligible for the prize. The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized. (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

[1] https://arcprize.org/leaderboard

[2] https://lab42.global/community-interview-jack-cole/

replies(2): >>40714174 #>>40716472 #

1. refulgentis ◴[18 Jun 24 04:52 UTC] No.40714174[source]▶

>>40713751 #

> It seems like you don’t understand the rules of the competition.

I don't think you "don't understand" anything :) I'd ask you, politely, to consider that when you're replying to other people in the future.

Better to bring to interactions the prior that your interlocutor is a presumably intelligent individual who can have a different interpretation of the same facts, than decide they just don't get it. The second is a quite lonely path.

> Entries don’t have access to the internet.

Correct. Per TFA, cofounder, Chollet, then me: this is an offline solution: the solution is the Python program found by an LLM.

> The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized.

Objection: relevancy? Is your claim here that it might be false so we shouldn't be discussing it at all?

> (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

I don't know what this means, "open LLM implementation" is either a term of art I don't recognize, or a misunderstanding of the situation.

I do assume you read the article, so I'm not trying to talk down to you, but to clarify:

The solution is the Python program, not the LLM prompts that iterated on a Python program. A common thread that would describe the confusing experience of reading your comment phrased aggressively and disputing everything up until you agree with me: your observations assume I assume the solution requires a cloud-based LLM to run. As noted above, it doesn't, which is also the thrust of my comment: they found a way to skirt what I thought the rules are, and the co-founder and Chollett have embraced it, publicly.

> There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

This was false before you posted, when I checked this morning, and it was false as early as 4 days ago, June 14th, we can confirm via archive.is. (prefix the URL you provided with archive.is/ to check for yourself)

> The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Did anyone claim it was?

> Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

I don't know what this means other than you're upset, but yes, sounds like both you and I agree that having an LLM generate Python programs isn't quite what we'd thought would be an AGI solution in the eyes of Chollet.

Alas, here we are.

replies(2): >>40714210 #>>40720941 #

2. nl ◴[18 Jun 24 05:03 UTC] No.40714210[source]▶

>>40714174 (TP) #

(Not the OP)

>> (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

> The solution is the Python program, not the LLM prompts that iterated on a Python program. A common thread that would describe the confusing experience of reading your comment phrased aggressively and disputing everything up until you agree with me: your observations assume I assume the solution requires a cloud-based LLM to run. As noted above, it doesn't, which is also the thrust of my comment: they found a way to skirt what I thought the rules are, and the co-founder and Chollett have embraced it, publicly.

I think the implication is that solutions that use an LLM via an API won't be eligible (the "no internet" rule).

This seems obvious to solve: can use GPT4 to generate catalogs in advance and a lesser, local LLM with good code abilities to select them.

I don't see why this skirts any rules you think were implied and I'm puzzled why you think it does.

> sounds like both you and I agree that having an LLM generate Python programs isn't quite what we'd thought would be an AGI solution in the eyes of Chollet.

> Alas, here we are.

Chollet noted that program synthesis was a promising approach, so it's not surprising to me that a program synthesis approach that also uses an LLM is effective.

3. elicksaur ◴[18 Jun 24 18:58 UTC] No.40720941[source]▶

>>40714174 (TP) #

From the leaderboard link (and on the archive version):

>ARC-AGI-Pub is a secondary leaderboard (in beta) measuring the public evaluation set. … The public evaluation set imposes no limitations on internet access or compute. At this time, ARG-AGI-Pub is not part of ARC Prize 2024 (eg. no prizes are associated with this leaderboard).

And, all the entries at time of writing and in the archive link say “You?…”. “ARC-AGI 2024 HIGH SCORES” which does have entries is on the private test set.

>I don't think you "don't understand" anything :)

I genuinely don’t understand if we are viewing the same websites.

replies(1): >>40721090 #

4. refulgentis ◴[18 Jun 24 19:11 UTC] No.40721090[source]▶

>>40720941 #

> I genuinely don’t understand if we are viewing the same websites.

We are! I missed the nuance on you're looking for a public leaderboard on the private test set. I do see it now, but I'm still confused as to how that's relevant here.

↑