Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 1 comments | 17 Jun 24 21:51 UTC | HN request time: 0.211s | source

Show context

asperous ◴[17 Jun 24 23:19 UTC] No.40712326[source]▶

Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.

Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.

replies(5): >>40712503 #>>40712555 #>>40712632 #>>40713120 #>>40713156 #

opdahl ◴[17 Jun 24 23:46 UTC] No.40712555[source]▶

>>40712326 #

Wouldn’t the real AGI test be that an AI would be able to do what the author did here and write this blog post?

replies(2): >>40712730 #>>40716667 #

1. killerstorm ◴[18 Jun 24 11:50 UTC] No.40716667[source]▶

>>40712555 #

I won't be surprised if GPT-5 would be able to do it: it knows that it's LLM, so it knows its limitations. It can write code to pre-process input in a format which is better understood, etc.

https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...

GPT-4 created a plan very similar to the article, i.e. it also suggested using Python to pre-process data. It also suggested using program synthesis. So I'd say it's already 90% there.

> "Execute the synthesized program on the test inputs."

> "Verify the outputs against the expected results. If the results are incorrect, iteratively refine the hypotheses and rules."

So people saying that it's ad-hoc are wrong. LLMs know how to solve these tasks, they are just not very good at coding, and iterative refinement tooling is in infancy.

↑