Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 2 comments | 17 Jun 24 21:51 UTC | HN request time: 0.398s | source

Show context

eigenvalue ◴[17 Jun 24 23:00 UTC] No.40712174[source]▶

The Arc stuff just felt intuitively wrong as soon as I heard it. I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism. The problem is, the optimism really seems to be justified, and the rate of improvement of LLMs in the past 12 months has been nothing short of astonishing.

So it's not at all surprising to me to see Arc already being mostly solved using existing models, just with different prompting techniques and some tool usage. At some point, the naysayers about LLMs are going to have to confront the problem that, if they are right about LLMs not really thinking/understanding/being sentient, then a very large percentage of people living today are also not thinking/understanding/sentient!

replies(11): >>40712233 #>>40712290 #>>40712304 #>>40712352 #>>40712385 #>>40712431 #>>40712465 #>>40712713 #>>40713110 #>>40713491 #>>40714220 #

1. adroniser ◴[17 Jun 24 23:25 UTC] No.40712385[source]▶

>>40712174 #

I don't see how the point about the typical human is relevant. Either you can reason or you can't, the ARC test is supposed to be an objective way to measure this. Clearly a vanilla LLM currently cannot do this, and somehow an expert crafting a super-specific prompt is supposed to be impressive.

replies(1): >>40712781 #

2. eigenvalue ◴[18 Jun 24 00:26 UTC] No.40712781[source]▶

>>40712385 (TP) #

The point is that if you have some test of whether an AI is intelligent that the vast majority of living humans would fail or do worse on than gpt4-o (let alone future LLMs) then it’s not a very persuasive argument.

↑