Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.
Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.
Hard to find the right float but worth trying I think.
This would sound more far-fetched if we knew exactly how they work, bit-by-bit. We've been training them statistically, via the data-for-code tradeoff. The question is not yet satisfactorily answered.
In this hypothetical, for every accusation that an LLM passes a test because it's been coached to do so, there's a counter that it was designed for "excessively human" AGI to begin with, maybe even that it was designed for the unconscious purpose of having humans pass it preferentially. The attorney for the hypothetical AGI in the LLM would argue that there are tons of "LLM AGI" problems it can solve that a human would struggle with.
Fundamentally, the tests are only useful insofar as they let us improve AI. The evaluation of novel approaches to pass them like this one should err in the approaches' favor, IMO. A 'gotcha' test is the least-useful kind.
The human brain is millions of years of brute force evolution in the making. Comparing it to a transformer or any other ANN really which essentially start from scratch relatively speaking doesn't mean much.
Anyway, my point was that humans butter direct their energy than randomly spamming ideas, at least with the innovation of the scientific method. But an LLM struggles deeply to perform reasoning.
This isn't really true. If you give an LLM a large prompt detailing a new spoken language, programming language or logical framework with a couple examples, and ask it to do something with it, it'll probably do a lot better at it than if you just let an average human read the same prompt and do the same task.
https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...
GPT-4 created a plan very similar to the article, i.e. it also suggested using Python to pre-process data. It also suggested using program synthesis. So I'd say it's already 90% there.
> "Execute the synthesized program on the test inputs."
> "Verify the outputs against the expected results. If the results are incorrect, iteratively refine the hypotheses and rules."
So people saying that it's ad-hoc are wrong. LLMs know how to solve these tasks, they are just not very good at coding, and iterative refinement tooling is in infancy.