Most active commenters
  • imperfect_light(5)
  • refulgentis(3)

←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 32 comments | | HN request time: 2.317s | source | bottom
1. eigenvalue ◴[] No.40712174[source]
The Arc stuff just felt intuitively wrong as soon as I heard it. I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism. The problem is, the optimism really seems to be justified, and the rate of improvement of LLMs in the past 12 months has been nothing short of astonishing.

So it's not at all surprising to me to see Arc already being mostly solved using existing models, just with different prompting techniques and some tool usage. At some point, the naysayers about LLMs are going to have to confront the problem that, if they are right about LLMs not really thinking/understanding/being sentient, then a very large percentage of people living today are also not thinking/understanding/sentient!

replies(11): >>40712233 #>>40712290 #>>40712304 #>>40712352 #>>40712385 #>>40712431 #>>40712465 #>>40712713 #>>40713110 #>>40713491 #>>40714220 #
2. HarHarVeryFunny ◴[] No.40712233[source]
Actually the solution being discussed here is the one that Chollet mentioned in his interview with Dwarkesh, and only bolsters his case.

The LLM isn't doing the reasoning here, it's just pattern matching the before/after diff and generating thousands of Python programs. The actual reasoning is done by an agentic like loop wrapped around the LLM, as described in the linked blog.

replies(1): >>40713162 #
3. Smaug123 ◴[] No.40712290[source]
> a very large percentage of people living today are also not thinking/understanding/sentient

This isn't that big a bullet to bite (https://www.lesswrong.com/posts/4AHXDwcGab5PhKhHT/humans-who... comes from well before ChatGPT's launch), and I myself am inclined to bite it. System 1 alone does not a general intelligence make, although the article is extremely interesting in asking the question "is System 1 plus Python enough for a general intelligence?". But it's not a very relevant philosophical point, because Chollet's position is consistent with humans being obsoleted and/or driven extinct whether or not the LLMs are "general intelligences".

His position is that training LLMs results in an ever-larger number of learned algorithms and no ability to construct new algorithms. This is consistent with the possibility that, after some threshold of size and training, the LLM has learned every algorithm it needs to supplant humans in (say) 99.9% of cases. (It would definitely be going out with a whimper rather than a bang, on that hypothesis, to be out-competed by something that _really is_ just a gigantic lookup table!)

4. threeseed ◴[] No.40712304[source]
a) 50% result is not solving the problem. Especially when the implementation is brute forcing the problem and is against the spirit of ARC.

b) He is not being overly negative of LLMs. In fact he believes they will play a role in any AGI system.

c) OpenAI CTO has publicly said that ChatGPT 5 will not be significantly better than existing models. So the rate of improvements you believe in simply doesn't match reality.

replies(3): >>40712459 #>>40714041 #>>40714049 #
5. traject_ ◴[] No.40712352[source]
> It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism.

I don't think it is like that but rather Chollet wants to see stronger neuroplasticity in these models. I think there is a divide between the effectiveness of existing AI models versus their ability to be autonomous, robust and consistently learn from unanticipated problems.

My guess is Chollet wants to see something more similar to biological organisms especially mammals or birds in their level of autonomous nature. I think people underestimate the degree of novel problems birds and mammals alone face in just simply navigating their environment and it is the comparison here that LLMs, for now at least, seem lacking.

So when he says LLMs are not sentient, he's asking to consider the novel problems animals let alone humans have to face in navigating their environment. This is especially apparent in young children but declines as we age and gain experience/lose a sense of novelty.

replies(1): >>40713640 #
6. adroniser ◴[] No.40712385[source]
I don't see how the point about the typical human is relevant. Either you can reason or you can't, the ARC test is supposed to be an objective way to measure this. Clearly a vanilla LLM currently cannot do this, and somehow an expert crafting a super-specific prompt is supposed to be impressive.
replies(1): >>40712781 #
7. TacticalCoder ◴[] No.40712431[source]
> I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism.

Chollet published his paper On the measure of intelligence in 2019. In Internet time that is a lifetime before the LLM hype started.

replies(2): >>40712651 #>>40712888 #
8. janalsncm ◴[] No.40712459[source]
For the record, a lot of problems might turn out to be like this, where we figure out a brute force approach that stands in for human creativity.
9. imtringued ◴[] No.40712465[source]
Yeah I agree. We have reached the end of LLMs. LLMs are infallible and require no further improvement. Anyone who points out shortcomings of current architectures and training approaches should be ignored as a naysayer. Anyone who proposes a solution to perceived flaws is a crank trying to fix something that was never broken. Everyone knows humans are incapable of internal monologues or visualization and vocalisation. Humans don't actually move their lips to speak to produce a sound that can be interpreted by a speaker of the same language, they produce universally understood tokens encoding objective reality and the fact that they use the local language is merely a habit that is hard to break out of.
replies(1): >>40720903 #
10. refulgentis ◴[] No.40712651[source]
Einstein, infamously, couldn't really make much progress with quantum physics, even though he invented the precursors (ex. Brownian motion). Your world model is hard to update.
replies(2): >>40713512 #>>40713624 #
11. biophysboy ◴[] No.40712713[source]
I don't think he's as critical as you say. He just views LLMs as the product of intelligence rather than intelligence itself. LLM fans will say this is a false distinction, I guess.

His definition of intelligence is interesting: something that can quickly achieve tasks with few priors or experience. I also think the idea of using human "Core Knowledge" priors is a clever way to make a test.

12. eigenvalue ◴[] No.40712781[source]
The point is that if you have some test of whether an AI is intelligent that the vast majority of living humans would fail or do worse on than gpt4-o (let alone future LLMs) then it’s not a very persuasive argument.
13. gwern ◴[] No.40712888[source]
From Chollet's perspective, the LLM hype started well before, with at least GPT-2 half a year before his paper, and he spent plenty of time mocking GPT-2 on Twitter before he came up with ARC as a rebuttal.
replies(1): >>40713697 #
14. lassoiat ◴[] No.40713110[source]
I am a chatGPT fan boy and have been quite impressed by 4o but I will really be impressed when it stops inventing aspects of python libraries that don't exists and instead just tells me it doesn't exist.

It literally just did this for me 15 minutes ago. You can't talk about AGI when it is this easy to push it over the edge into something it doesn't know.

Paper references have got better the last 12 months but just this week it made up both a book and paper for me that do not exist. The authors exist and they did not write what it said they did.

It is very interesting if you ask "do you understand your responses?" sometimes it will say yes and sometimes it will so no not like a human understands.

We should forget about AGI until it can at least say it doesn't know something. It is hardly a sign of intelligence in humans to make up answers to questions you don't know.

replies(1): >>40713818 #
15. awwaiid ◴[] No.40713162[source]
When you peer into the soul of the machine it delicately resolves to `while(1){...}`. All Hail The REPL.
16. imperfect_light ◴[] No.40713491[source]
Did you listen to what Chollet said? How much of LLM improvements are due to enlarging the training sets to cover more problems and how much is due to any emergent properties?
17. imperfect_light ◴[] No.40713512{3}[source]
A bit of a stretch given that Chollet is a researcher in deep learning and transformers and his criticism is that memorization (training LLMs on lots and lots of problems) doesn't equate to AGI.
replies(1): >>40713693 #
18. infgeoax ◴[] No.40713624{3}[source]
But it's his EPR paper inspired the Bell's inequality and pushed the field further. Yes he was wrong about how reality works, but still he asked the right question.
19. infgeoax ◴[] No.40713640[source]
Agree. When I first saw ARC, my reaction was this could possibly be the kind of problem that gives us evolutionary pressure.
20. refulgentis ◴[] No.40713693{4}[source]
> A bit of a stretch

Is that true?

C.f. what we're discussing

He's actively encouraging using LLMs to solve his benchmark, called ARC AGI.

8 hours ago, from Chollet, re: TFA

"The best solution to fight combinatorial explosion is to leverage intuition over the structure of program space, provided by a deep learning model. For instance, you can use a LLM to sample a program..."

Source: https://x.com/fchollet/status/1802801425514410275

replies(1): >>40713793 #
21. modeless ◴[] No.40713697{3}[source]
It's a very convincing rebuttal considering that GPT-3 and GPT-4 came out after ARC but made no significant progress on it. He seemingly had the single most accurate and verifiable prediction of anyone in the world (in 2019) about exactly what type of tasks scaled LLMs would be bad at.
replies(1): >>40729440 #
22. imperfect_light ◴[] No.40713793{5}[source]
The stretch was in reference to comparing Chollet to Einstein. Chollet clearly understands LLMs (and transformers and deep learning), he simply doesn't believe they are sufficient for AGI.
replies(1): >>40714132 #
23. motoxpro ◴[] No.40713818[source]
Every time you’re wrong and you disagree with someone who is right you are inventing things that don’t exist.

Unless you’re saying you have never held on to a wrong opinion that was at some point proven to be wrong?

24. ◴[] No.40714041[source]
25. hackerlight ◴[] No.40714049[source]
Skeptical about (c), source please. She did say they don't have anything much better than GPT-4o currently, but GPT-5 likely only started training recently.
26. refulgentis ◴[] No.40714132{6}[source]
I don't know what you mean, it's a straightforward analogy, but yes, that's right, except for the part where he's heralding this news by telling people the LLM is an underexplored solution space for a possible solution to his AGI benchmark he made to disprove LLMs are AGI.

I don't mean to offend, but to be really straightforward: he's the one saying it's possible they might be AGI now. I'm as flummoxed as you, but I think its hiding the ball to file it under "he doesn't mean what he's saying, because he doesn't believe LLMs can ever be AGI." The only steelman for that is playing at: AGI-my-benchmark, which I say is for AGI, is not the AGI I mean

replies(1): >>40714424 #
27. Lockal ◴[] No.40714220[source]
That's a big jump in generalization that bruteforcing 4 colors in 9x9 grids with 8000 programs has anything near to what sentient human can do.

Back in the days similar generalization was used for Deep Blue chess computer. Computer won in 1997, but the AGI abyss is still as big.

28. imperfect_light ◴[] No.40714424{7}[source]
You're reading a whole lot into a tweet, in his interview with Dwarkesh Patel he says, about 20 different times, that scaling LLMs (as they are currently conceived) won't lead to AGI.
replies(1): >>40714964 #
29. anoncareer0212 ◴[] No.40714964{8}[source]
You keep changing topics so I don't get it either, I can attest it's not a fringe view that the situation is interesting, seen it discussed several times today by unrelated people.
replies(1): >>40720617 #
30. imperfect_light ◴[] No.40720617{9}[source]
He's said it pretty clearly, an LLM could be part of the solution in combination with program synthesis, but an LLM alone won't achieve AGI.
31. mrtranscendence ◴[] No.40720903[source]
Sometimes, when I'm undertaking the arduous work of assigning probabilities to everything I could possibly say next in a conversation, I wish that I weren't merely a stochastic autoregressive next-token generator. Them's the breaks, though.
32. gwern ◴[] No.40729440{4}[source]
Well, that's true inasmuch as every other prediction did far worse. Saying ARC did the best is passing a low bar when your competition is people like Gary Marcus or HNers saying 'yeah but no scaled-up GPT-3 could ever write a whole program'...

But since ARC was from the start clearly a vision task - most of these transforms or rules make no sense without a visual geometric prior - it wasn't that convincing, and we see plenty of progress with LLMs.