←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 2 comments | | HN request time: 0.411s | source
Show context
extr ◴[] No.40712008[source]
Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.

I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.

[0] https://x.com/eatpraydiehard/status/1632671307254099968

[1] https://x.com/eatpraydiehard/status/1632683214329479169

replies(2): >>40712644 #>>40716335 #
refulgentis ◴[] No.40712644[source]
> brute forcing python solutions is a very "non human" approach.

ARC-AGI has odd features that leave me flummoxed by the naming and the attendant prize money and hype.

It is one singular task and frankly I strongly suspect someone could beat it within 30 days[1], in an unsatisfying way, as you note.

There's so much alpha that can be pieced together from here, ex. the last couple Google papers use the 1M context to do *500-shot*, i.e. 500 question answer examples. IIRC most recent showed raising travelling-salesman problem solve rate from 3 to 35%.

[1] I pre-registered this via a Twitter post, about 48 hours ago, i.e. before this result was announced.

replies(2): >>40712736 #>>40714197 #
elicksaur ◴[] No.40712736[source]
The private test set has been available to crack for almost four years now. There was also a monetary prize competition run last year.

In your opinion, what has changed that would accelerate a solution to the next 30 days?

replies(1): >>40713193 #
refulgentis ◴[] No.40713193[source]
Prize money meant people would more cleverly strain the rule that "the private test set stays private, no GPT4o, Claude etc.", as shown by the TFA.

This sort of idea would then be shared openly on new sites, creating more attempts. Fallout I did not anticipate was getting widespread attentional on general tech news sites, and then getting public comment from a prize co-founder confirming it was acceptable.

replies(1): >>40713751 #
elicksaur ◴[] No.40713751[source]
It seems like you don’t understand the rules of the competition. Entries don’t have access to the internet. The OP acknowledges in their post that this is not eligible for the prize. The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized. (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

[1] https://arcprize.org/leaderboard

[2] https://lab42.global/community-interview-jack-cole/

replies(2): >>40714174 #>>40716472 #
1. YeGoblynQueenne ◴[] No.40716472[source]
I wish this comment was less confrontational because there's useful information in it and several points I agree with.
replies(1): >>40720803 #
2. elicksaur ◴[] No.40720803[source]
Hey! Genuinely, thank you for the feedback!