Getting 50% (SoTA) on Arc-AGI with GPT-4o

1. asperous ◴[17 Jun 24 23:19 UTC] No.40712326[source]▶

Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.

Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.

replies(5): >>40712503 #>>40712555 #>>40712632 #>>40713120 #>>40713156 #

2. janalsncm ◴[17 Jun 24 23:40 UTC] No.40712503[source]▶

>>40712326 (TP) #

Perhaps if we don’t know how to create an evaluation that can’t be “gamed” it tells us something about how special our intelligence really is?

replies(1): >>40715393 #

3. opdahl ◴[17 Jun 24 23:46 UTC] No.40712555[source]▶

>>40712326 (TP) #

Wouldn’t the real AGI test be that an AI would be able to do what the author did here and write this blog post?

replies(2): >>40712730 #>>40716667 #

4. sheeshkebab ◴[17 Jun 24 23:59 UTC] No.40712632[source]▶

>>40712326 (TP) #

Show me a test and I’ll show you a neural network that passes it… used to be an saying.

5. atroche ◴[18 Jun 24 00:17 UTC] No.40712730[source]▶

>>40712555 #

Yep, but a float is more useful than a bool for tracking progress, especially if you want to answer questions like "how soon can we expect (drivers/customer support staff/programmers) to lose their jobs?"

Hard to find the right float but worth trying I think.

replies(1): >>40713241 #

6. worstspotgain ◴[18 Jun 24 01:19 UTC] No.40713120[source]▶

>>40712326 (TP) #

Let me play devil's advocate for a second. Let's suppose that with LLMs, we've actually invented an AGI machine that also happens to produce useful textual responses to a prompt.

This would sound more far-fetched if we knew exactly how they work, bit-by-bit. We've been training them statistically, via the data-for-code tradeoff. The question is not yet satisfactorily answered.

In this hypothetical, for every accusation that an LLM passes a test because it's been coached to do so, there's a counter that it was designed for "excessively human" AGI to begin with, maybe even that it was designed for the unconscious purpose of having humans pass it preferentially. The attorney for the hypothetical AGI in the LLM would argue that there are tons of "LLM AGI" problems it can solve that a human would struggle with.

Fundamentally, the tests are only useful insofar as they let us improve AI. The evaluation of novel approaches to pass them like this one should err in the approaches' favor, IMO. A 'gotcha' test is the least-useful kind.

replies(1): >>40713521 #

7. yieldcrv ◴[18 Jun 24 01:25 UTC] No.40713156[source]▶

>>40712326 (TP) #

its LLM grade school. let them cook, train these things to match utility in our world. I'm not married to the "AGI" goal if there is other utility along the way.

8. opdahl ◴[18 Jun 24 01:41 UTC] No.40713241{3}[source]▶

>>40712730 #

I agree, but it does seem a bit strange that you are allowed to "custom-fit" an AI program to solve a specific benchmark. Shouldn't there be some sort of rule that for something to be AGI it should work as "off-the-shelf" as possible?

replies(1): >>40713415 #

9. soist ◴[18 Jun 24 02:09 UTC] No.40713415{4}[source]▶

>>40713241 #

If OpenAI had an embedded python interpreter or for that matter an interpreter for lambda calculus or some other equally universal Turing machine then this approach would work but there are no LLMs with embedded symbolic interpreters. LLMs currently are essentially probability distributions based on a training corpus and do not have any symbolic reasoning capabilities. There is no backtracking, for example, like in Prolog.

10. vlovich123 ◴[18 Jun 24 02:29 UTC] No.40713521[source]▶

>>40713120 #

There’s every reason to believe that AGI is meaningfully different from LLMs because humans do not take anywhere near this amount of training data to create inferences (that and executive planning and creative problem solving are clear weak spots in LLMs)

replies(3): >>40713651 #>>40714011 #>>40718212 #

11. og_kalu ◴[18 Jun 24 02:58 UTC] No.40713651{3}[source]▶

>>40713521 #

>There’s every reason to believe that AGI is meaningfully different from LLMs because humans do not take anywhere near this amount of training data to create inferences

The human brain is millions of years of brute force evolution in the making. Comparing it to a transformer or any other ANN really which essentially start from scratch relatively speaking doesn't mean much.

replies(1): >>40713702 #

12. infgeoax ◴[18 Jun 24 03:05 UTC] No.40713702{4}[source]▶

>>40713651 #

Plus it's unclear if the amount of data used to "train" a human brain is really less than what GPT4 used. Imagine all the inputs from all the senses of a human over a lifetime: the sound, light, touches, interactions with peers, etc.

replies(2): >>40714140 #>>40714993 #

13. visarga ◴[18 Jun 24 04:13 UTC] No.40714011{3}[source]▶

>>40713521 #

How many attempts have there been for humans to solve math or science outstanding problems? We're also kind of spamming with ideas until one works out

replies(1): >>40714044 #

14. vlovich123 ◴[18 Jun 24 04:21 UTC] No.40714044{4}[source]▶

>>40714011 #

I’ll give you as much time as you want with an LLM and am 100% sure that it won’t solve a single outstanding complex math problem.

replies(2): >>40714251 #>>40718919 #

15. Jensson ◴[18 Jun 24 04:40 UTC] No.40714140{5}[source]▶

>>40713702 #

But that is of little help when you want to train an LLM to do the job at your company. A human requires just a little bit of tutorials and help, an LLM still require an unknown amount of data to get up to speed since we haven't reached that level yet.

replies(1): >>40714356 #

16. danielbln ◴[18 Jun 24 05:11 UTC] No.40714251{5}[source]▶

>>40714044 #

I can say the same about myself, and I would probably consider myself generally intelligent.

replies(1): >>40714383 #

17. infgeoax ◴[18 Jun 24 05:34 UTC] No.40714356{6}[source]▶

>>40714140 #

Yeah humans can generalize much faster than LLM with far fewer "examples" running on sandwiches and coffee.

replies(1): >>40714614 #

18. vlovich123 ◴[18 Jun 24 05:41 UTC] No.40714383{6}[source]▶

>>40714251 #

There’s a meaningful difference between a silicon intelligence and an organic one. Every silicon intelligence is closer to an equally smart clone whereas organic ones have much more variance (not to mention different training).

Anyway, my point was that humans butter direct their energy than randomly spamming ideas, at least with the innovation of the scientific method. But an LLM struggles deeply to perform reasoning.

19. logicchains ◴[18 Jun 24 06:23 UTC] No.40714614{7}[source]▶

>>40714356 #

>Yeah humans can generalize much faster than LLM with far fewer "examples" running on sandwiches and coffee.

This isn't really true. If you give an LLM a large prompt detailing a new spoken language, programming language or logical framework with a couple examples, and ask it to do something with it, it'll probably do a lot better at it than if you just let an average human read the same prompt and do the same task.

replies(1): >>40723361 #

20. alchemist1e9 ◴[18 Jun 24 07:23 UTC] No.40714993{5}[source]▶

>>40713702 #

Don’t forget all the lifetimes of all ancestors as well. A lot of our intelligence is something we are born with and a result of many millions of years of evolution.

21. lucianbr ◴[18 Jun 24 08:37 UTC] No.40715393[source]▶

>>40712503 #

I don't know how to create a liver, or test one, so what does that say about my liver? Pretty much nothing.

22. killerstorm ◴[18 Jun 24 11:50 UTC] No.40716667[source]▶

>>40712555 #

I won't be surprised if GPT-5 would be able to do it: it knows that it's LLM, so it knows its limitations. It can write code to pre-process input in a format which is better understood, etc.

https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...

GPT-4 created a plan very similar to the article, i.e. it also suggested using Python to pre-process data. It also suggested using program synthesis. So I'd say it's already 90% there.

> "Execute the synthesized program on the test inputs."

> "Verify the outputs against the expected results. If the results are incorrect, iteratively refine the hypotheses and rules."

So people saying that it's ad-hoc are wrong. LLMs know how to solve these tasks, they are just not very good at coding, and iterative refinement tooling is in infancy.

23. bongodongobob ◴[18 Jun 24 14:22 UTC] No.40718212{3}[source]▶

>>40713521 #

Our compute architecture has been brute forced via an revolutionary algorithm over a billion years. An LLM approaching our capabilities in like a year is pretty fucking good.

24. coolspot ◴[18 Jun 24 15:27 UTC] No.40718919{5}[source]▶

>>40714044 #

> I’ll give you as much time as you want with an LLM

With infinite amount of time you can LLM brute force whole search space. Infinite monkeys with typewriters.

25. infgeoax ◴[19 Jun 24 00:03 UTC] No.40723361{8}[source]▶

>>40714614 #

Hmm, but is it really "generalizing" or just pulling information from the training data? I think that's what this benchmark is really about: to adapt to something it has never seen before quickly.