Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 4 comments | 17 Jun 24 21:51 UTC | HN request time: 0.691s | source

Show context

extr ◴[17 Jun 24 22:42 UTC] No.40712008[source]▶

Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.

I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.

[0] https://x.com/eatpraydiehard/status/1632671307254099968

[1] https://x.com/eatpraydiehard/status/1632683214329479169

replies(2): >>40712644 #>>40716335 #

refulgentis ◴[18 Jun 24 00:00 UTC] No.40712644[source]▶

>>40712008 #

> brute forcing python solutions is a very "non human" approach.

ARC-AGI has odd features that leave me flummoxed by the naming and the attendant prize money and hype.

It is one singular task and frankly I strongly suspect someone could beat it within 30 days[1], in an unsatisfying way, as you note.

There's so much alpha that can be pieced together from here, ex. the last couple Google papers use the 1M context to do *500-shot*, i.e. 500 question answer examples. IIRC most recent showed raising travelling-salesman problem solve rate from 3 to 35%.

[1] I pre-registered this via a Twitter post, about 48 hours ago, i.e. before this result was announced.

replies(2): >>40712736 #>>40714197 #

nl ◴[18 Jun 24 04:58 UTC] No.40714197[source]▶

>>40712644 #

I don't think this is "non-satisfying" at all.

Program synthesis has been mentioned as a promising approach by François Chollet, and that's exactly what this is.

The place I find slightly unsatisfying is this:

> Sample vast, vast numbers of completions (~5,000 per problem) from GPT-4o.

> Take the most promising 12 completions for each problem, and then try to fix each by showing GPT-4o what this program actually outputs on the examples, and then asking GPT-4o to revise the code to make it correct. We sample ~3,000 completions that attempt to fix per problem in total across these 12 starting implementations.

I'd been tossing around a MCTS idea similar to AlphaGo, based on the idea that the end transformation is a series of sub-transformations. I feel like this could work well alongside the GPT-4o completion catalog. (This isn't an original observation or anything)

replies(2): >>40714631 #>>40716375 #

1. bubblyworld ◴[18 Jun 24 06:25 UTC] No.40714631[source]▶

>>40714197 #

Classic, I've been doing the same, writing an alphazero for the transformation part. What seems _much_ harder is picking a decent set of transformations/concepts to work with, or more generally automating that process. Maybe you're right that LLMs could help there!

replies(1): >>40716357 #

2. luke-stanley ◴[18 Jun 24 11:14 UTC] No.40716357[source]▶

>>40714631 (TP) #

Reminds me of NVIDIA Eureka: https://github.com/eureka-research/Eureka

replies(2): >>40716671 #>>40717870 #

3. bubblyworld ◴[18 Jun 24 11:51 UTC] No.40716671[source]▶

>>40716357 #

Very nice! Thanks for the link, that's great inspiration.

4. nl ◴[18 Jun 24 13:51 UTC] No.40717870[source]▶

>>40716357 #

Great link, thanks.

↑