Getting 50% (SoTA) on Arc-AGI with GPT-4o

To me the big take-aways here are:

1) Most of the heavy lifting is being done by search. We're talking about having the LLM generate thousands of candidate solutions, and they're mostly bad enough that "just pick the ones that get kinda close on the examples" is a meaningful operation.

2) More samples improves performance despite the fact that GPT-4o's vision is not capable of parsing the inputs. I'm curious how much performance would degrade if you shuffled the images passed to the model (but used the correct images when evaluating which candidates to keep).

3) It's definitely true that the LLM has to be giving you something more than random programs. At the very least, the LLM knows how to craft parsimonious programs that are more likely to be the solution. It may be that it's providing more than that, but it's not clear to me exactly how much information on the correct search space is coming from the hand-crafted examples in the prompt.

Overall, the work to get this far is very impressive, but it doesn't really move the needle for me on whether GPT-4 can do ARC puzzles. It does, however, show me that search is surprisingly powerful on this task.