(redwoodresearch.substack.com)

394 points tomduncalf | 2 comments | 17 Jun 24 21:51 UTC | HN request time: 0.416s | source

Show context

asperous ◴[17 Jun 24 23:19 UTC] No.40712326[source]▶

Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.

Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.

replies(5): >>40712503 #>>40712555 #>>40712632 #>>40713120 #>>40713156 #

opdahl ◴[17 Jun 24 23:46 UTC] No.40712555[source]▶

>>40712326 #

Wouldn’t the real AGI test be that an AI would be able to do what the author did here and write this blog post?

replies(2): >>40712730 #>>40716667 #

atroche ◴[18 Jun 24 00:17 UTC] No.40712730[source]▶

>>40712555 #

Yep, but a float is more useful than a bool for tracking progress, especially if you want to answer questions like "how soon can we expect (drivers/customer support staff/programmers) to lose their jobs?"

Hard to find the right float but worth trying I think.

replies(1): >>40713241 #

1. opdahl ◴[18 Jun 24 01:41 UTC] No.40713241[source]▶

>>40712730 #

I agree, but it does seem a bit strange that you are allowed to "custom-fit" an AI program to solve a specific benchmark. Shouldn't there be some sort of rule that for something to be AGI it should work as "off-the-shelf" as possible?

replies(1): >>40713415 #

2. soist ◴[18 Jun 24 02:09 UTC] No.40713415[source]▶

>>40713241 (TP) #

If OpenAI had an embedded python interpreter or for that matter an interpreter for lambda calculus or some other equally universal Turing machine then this approach would work but there are no LLMs with embedded symbolic interpreters. LLMs currently are essentially probability distributions based on a training corpus and do not have any symbolic reasoning capabilities. There is no backtracking, for example, like in Prolog.

↑

Getting 50% (SoTA) on Arc-AGI with GPT-4o