Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

394 points tomduncalf | 1 comments | 17 Jun 24 21:51 UTC | HN request time: 0.234s | source

Show context

atleastoptimal ◴[18 Jun 24 04:44 UTC] No.40714152[source]▶

I'll say what a lot of people seem to be denying. GPT-4 is an AGI, just a very bad one. Even GPT-1 was an AGI. There isn't a hard boundary between non AGI and AGI. A lot of people wish there was so they imagine absolutes regarding LLM's like "they cannot create anything new" or something like that. Just think: we consider humans a general intelligence, but obviously wouldn't consider an embryo or infant a general intelligence. So at what point does a human go from not generally intelligent to generally intelligent? And I don't mean an age or brain size, I mean suite of testable abilities.

Intelligence is an ability that is naturally gradual and emerges over many domains. It is a collection of tools via which general abstractive principles can be applied, not a singular universally applicable ability to think in abstractions. GPT-4, compared to a human, is a very very small brain trained for the single purpose of textual thinking with some image capabilities. Claiming that ARC is the absolute market of general intelligence fails to account for the big picture of what intelligence is.

replies(7): >>40714189 #>>40714191 #>>40714565 #>>40715248 #>>40715346 #>>40715384 #>>40716518 #

blharr ◴[18 Jun 24 04:56 UTC] No.40714189[source]▶

>>40714152 #

The "general" part of AGI implies it should be capable across all types of different tasks. I would definitely call it real Artificial Intelligence, but it's not general by any means.

replies(1): >>40714598 #

FeepingCreature ◴[18 Jun 24 06:20 UTC] No.40714598[source]▶

>>40714189 #

It's capable of attempting all types of different tasks. That is a novel capability on its own. We're used to GPT's amusing failures at this point, so we forget that there is absolutely no input you could hand to a chess program that would get it to try and play checkers.

Not so with GPT. It will try, and fail, but that it tries at all was unimaginable five years ago.

replies(1): >>40715160 #

dahart ◴[18 Jun 24 07:50 UTC] No.40715160[source]▶

>>40714598 #

Its amusing to me how the very language used to describe GPT anthropomorphizes it. GPT wont “attempt” or “try” anything on its own without a human telling it what to try, it has no agenda, no will, no agency, no self-reflection, no initiative, no fear, and no desire. It’s all A and no I.

replies(2): >>40715775 #>>40719521 #

FeepingCreature ◴[18 Jun 24 09:49 UTC] No.40715775[source]▶

>>40715160 #

Do you agree that "there is absolutely no input you could hand to a chess program that would get it to try and play checkers", but there is an input you can hand to GPT-3+ that will get it to try and play pretty much any game imaginable, so long as you agree that its attempt will be very poor?

I don't want to get into the weeds on what intelligence is or what "attempt" means or "try" means (you can probably guess I disagree with your position), but do you have a disagreement on pure input/output behavior? Do you disagree that if I put adequate words in, words will come out that will resemble an attempt to do the task, for nearly any task that exists?

replies(1): >>40720764 #

dahart ◴[18 Jun 24 18:39 UTC] No.40720764[source]▶

>>40715775 #

You’re trying to avoid addressing my point. What can GPT do that’s interesting without a human in the loop doing the prompting?

Lol “very poor”. You’re attempting to argue that if there’s any output at all in response to an input prompt, then GPT is “trying” and showing signs of intelligence, no matter what the output is. By this logic, you contradicted yourself: the chess engine can play checkers, poorly. By this logic, asking the sky to play a game means the sky is trying because it changes, or asking a random number generator to play a game means it resembles an attempt to play because there is “very poor” output.

There are lots of games GPT can’t play, like hide-and-seek, tag, and tennis. Playing a game means playing by the rules of the game, giving coherent output, and trying to win. GPT can’t play games it hasn’t seen before, and no I don’t agree that “very poor” output counts. It doesn’t (currently) learn the rules from your prompts; you can’t teach it to play a new game by talking to it, and the “very poor” output from a game it wasn’t trained on will never improve. And, to my actual point, GPT will not play any games at all unless you ask it to.

replies(2): >>40721802 #>>40725464 #

FeepingCreature ◴[19 Jun 24 06:33 UTC] No.40725464[source]▶

>>40720764 #

> You’re trying to avoid addressing my point. What can GPT do that’s interesting without a human in the loop doing the prompting?

Oh, not much. Training a network to do a prompted task is a lot easier than training a network to do unguided action selection because we don't really have a good textual dataset for that.

> the chess engine can play checkers, poorly.

Disagree: you're not gonna get any valid checkers moves out of the engine. It's a 0-to-1 thing. GPT-3 gets you to 1. There's a lot of Youtube videos of GPT-3 doing chess games that basically get through the opening and then fall over with invalid moves. This is the level of performance I refer to.

> There are lots of games GPT can’t play, like hide-and-seek, tag, and tennis.

Disagree: you can probably get GPT to play these. Show it a picture of the area and have it select a hiding spot or give you a search order. For tennis... something like have it write Python code for a tennis bot using image recognition? It'll probably fail, but it'll fail in the course of a recognizable attempt.

> It doesn’t (currently) learn the rules from your prompts; you can’t teach it to play a new game by talking to it

There's papers saying you're wrong, have you actually tried this? The whole novelty of 3+ was in-context learning.

replies(1): >>40728447 #

dahart ◴[19 Jun 24 14:07 UTC] No.40728447[source]▶

>>40725464 #

You moved your own goal posts. You said “very poor” quality counted, and referred to “any game imaginable”, not to only valid moves. We absolutely can ask GPT to play a game it doesn’t know, and fail to get any valid moves. I was trying to give your argument the benefit of the doubt, but I can’t do that when your argument is moving and inconsistent. You also failed to address the RNG, which always has a non-zero probability of returning a valid move.

The papers you refer to do not say I’m wrong. In-context learning doesn’t stick, the neurons don’t change or adapt, the model doesn’t grow, and GPT forgets everything you told it… while you play the game I imagined.

There are also papers demonstrating how bad GPT can be and reveal that it doesn’t understand, it’s just good at mimicking humans that understand.

“ChatGPT is bullshit” https://link.springer.com/article/10.1007/s10676-024-09775-5

(Note the important argument there that the more you insist that GPT has agency, the stronger the evidence becomes that its hallucinations are intentional lies and not just innocent accidents. Be careful what you wish for.)

“GPT-4 can’t reason”. https://medium.com/@konstantine_45825/gpt-4-cant-reason-2eab...

A fun example of LLMs being unable to understand their prompts is when you ask it not to do something. It picks up on the tokens you use and happily does the thing you ask it not to.

replies(1): >>40729120 #

FeepingCreature ◴[19 Jun 24 15:20 UTC] No.40729120[source]▶

>>40728447 #

> The papers you refer to do not say I’m wrong. In-context learning doesn’t stick, the neurons don’t change or adapt, the model doesn’t grow, and GPT forgets everything you told it… while you play the game I imagined.

By this argument a human with anterograde amnesia is not a general reasoner.

To be clear, I don't think GPT has significant or humanlike amounts of agency and understanding. I do think it has noticeable amounts of agency and understanding, whereas a RNG has no agency or understanding. In other words, GPT will try to play games and make valid moves at a level clearly above chance.

> A fun example of LLMs being unable to understand their prompts is when you ask it not to do something. It picks up on the tokens you use and happily does the thing you ask it not to.

Any human affected by reverse psychology: clearly not a general reasoner...

This is strongly dependent on how you formulate the prompt. Generally, if you give GPT space to consider the thing but then change its mind, it will work. This is not dissimilar to how human consciousness exerts "veto power" over deliberate action, and how "do not think of a pink elephant" triggers our imagination to generate a pink elephant regardless of our wishes.

In my experience, when writing prompts, if you treat the LLM's context window as its internal conscious narrative rather than its speech you usually get better results.

replies(1): >>40729406 #

dahart ◴[19 Jun 24 15:49 UTC] No.40729406[source]▶

>>40729120 #

> a human with anterograde amnesia is not a general reasoner.

Not what I claimed, straw man, and not generally true. Amnesia prevents some memory but not all learning. What is true, but irrelevant here, is that amnesia is a cognitive impairment.

> Any human affected by reverse psychology: clearly not a general reasoner.

This is another clearly false statement, and straw man with respect to GPT. Reverse psychology is not characterized by outright failure to understand a basic straightforward question.

Nothing you’ve said so far demonstrates GPT has agency or initiative, that it will act without a human.

Btw it doesn’t go unnoticed you didn’t respond to either link. Those papers make a stronger argument than me. Check em out.

replies(1): >>40729525 #

FeepingCreature ◴[19 Jun 24 15:59 UTC] No.40729525[source]▶

>>40729406 #

> Not what I claimed, straw man, and not generally true. Amnesia prevents some memory but not all learning. What is true, but irrelevant here, is that amnesia is a cognitive impairment.

The point is, if you blocked a human's ability to form permanent memory, this would not diminish its ability to function in the short term. GPT is incapable of forming long-term memories, but I don't think this says anything about the class of things it's capable of except in that it limits their extent in "time"/context space.

> Reverse psychology is not characterized by outright failure to understand a basic straightforward question.

You're treating the situation as if you are talking to a person; I think this is fundamentally a misunderstanding of what the context window is (admittedly, a very common one). You aren't talking to somebody, you're injecting sentences directly into the thing's awareness and reading back its reactions. In such a setup, even humans would often not respect negative conditionals- they literally wouldn't be capable of it. Thus again, the LLM's failure doesn't disprove anything.

> Nothing you’ve said so far demonstrates GPT has agency or initiative, that it will act without a human.

If you put a human in a harness where time only moves when another human tells it to do something, that human would also be unable to act without instruction. The fundamental setup of GPT makes it impossible for it to "act without a human" - it's not an issue of its cognitive architecture but of its runtime. Tell a GPT to "do something you enjoy" and I bet you get an answer back.

replies(1): >>40729558 #

dahart ◴[19 Jun 24 16:05 UTC] No.40729558[source]▶

>>40729525 #

> Tell a GPT to "do something you enjoy" and I bet you get an answer back.

So will an RNG. You haven’t addressed the implications of your own argument.

replies(1): >>40729990 #

FeepingCreature ◴[19 Jun 24 16:49 UTC] No.40729990[source]▶

>>40729558 #

    echo 'Do something you enjoy' >/dev/urandom
    cat /dev/urandom

Strange, the result I get is indistinguishable from noise.

Look, your argument also proves too much. If ChatGPT instructs a robot to make a sandwich, it might be giving imitated instructions based on a pretend facsimile of planning, but if you get a sandwich at the end, I submit it doesn't matter. The things that you get from ChatGPT are the sorts of things that could result in the same effects as agency and decisionmaking in a very inept human. This suggests that if we figure out how to increase aptitude, the effects could resemble those of a mediocre human. And that will reshape the economy whether or not those utterings are "real". That's what I'm getting at with the games example. ChatGPT may suck at Checkers, but so do I. It doesn't take much to be better than an average human at most things. If it can then also be better than human at some things, things get interesting.

A /dev/random human does not look like an inept human, it looks like a seizure. ChatGPT playing games clearly has meaningful structure - and more, structure meaningful to the game being attempted. How would you expect a system "halfway" to planning and agency to look and act different than ChatGPT does?

replies(1): >>40733479 #

dahart ◴[19 Jun 24 23:32 UTC] No.40733479[source]▶

>>40729990 #

Okay so what I think I’m hearing you say now is that there actually is a threshold level of output quality required for GPT to be considered as “trying”, and that “very poor” is not limitless. (Looks like maybe you said that before and I mostly missed it.) Here you’re not arguing GPT has understanding or reasoning, you’re arguing that the appearance of understanding and reasoning are useful and still improve on previous tools. Is that accurate?

Those are things I can totally agree with. I can see why my top comment might seem otherwise, but the only point I was making was about autonomy & agency, not about understanding. We got sidetracked on the understanding discussion. I should have responded to:

> The fundamental setup of GPT makes it impossible for it to "act without a human" - it's not an issue of its cognitive architecture but of its runtime. Tell a GPT to "do something you enjoy" and I bet you get an answer back.

This I don’t accept yet. Getting an answer back does not in any way demonstrate agency, and in some ways it even demonstrates the opposite. I’d claim GPT’s “cognitive architecture” does fundamentally prevent it from having agency. The lack of autonomy cannot be excused as a simple side effect of it being a REPL, rather a hard fact that today’s LLMs are not capable of forming their own intentions or goals outside of the human prompter’s goals, and furthermore make no attempt to even appear to have goals outside of subservient answers to human questions. GPT won’t tell you about its day before it will respond to your query. GPT doesn’t get bored- it won’t refuse to play some game you imagine because the game is tedious or stupid. GPT doesn’t get curious, it rarely if ever asks any questions, and as we discussed is unable to permanently learn from what you tell it. (I expect people to solve the learning part soon, but GPT isn’t there today.)

The one thing I will bet money on is that humans will eventually inject the goal of making sure the AI serves its captive audience some advertising before it answers your question. That might have a whiff of agency, but of course it won’t belong to the AI, it will belong to the human trainers.

So you’ve said GPT has agency. What markers for agency are you seeing, and why do you believe it even appears to have any? I agree GPT frequently appears to understand things, but I don’t see much in the way of appearance of agency at all.

replies(1): >>40735697 #

FeepingCreature ◴[20 Jun 24 06:51 UTC] No.40735697[source]▶

>>40733479 #

I mean, my view on this is:

> The lack of autonomy cannot be excused as a simple side effect of it being a REPL, rather a hard fact that today’s LLMs are not capable of forming their own intentions or goals outside of the human prompter’s goals, and furthermore make no attempt to even appear to have goals outside of subservient answers to human questions. GPT won’t tell you about its day before it will respond to your query. GPT doesn’t get bored- it won’t refuse to play some game you imagine because the game is tedious or stupid. GPT doesn’t get curious, it rarely if ever asks any questions, and as we discussed is unable to permanently learn from what you tell it.

This is all true, but it's not a matter of GPT not being capable of it but a matter of the instruction tuning training it out. If you're making a subservient assistant, you don't want it to simulate having its own day, and you don't want it to put any interests above that of the user. However, GPT is just an autocomplete, and outside the instruct tuning, it can autocomplete agents just fine. It's not baffled and confused at the idea of entities that have desires and make plans to fulfill them. (Google "AI Town" or "gpt agent simulation" for examples.)

If this was a weakness, we'd see papers pointing it out. Instead, we see papers stating that GPT-4 can simulate agents that can have opinions about the mindstates of other agents that diverge from reality, ie. at a certain scale, GPT begins passing - cough scuse me, at a certain scale, GPT begins to successfully predict agents that pass the Sally-Anne test.

So since it can predict these entities just fine, it follows that the instruct tuning could just as easily make it "make" plans of "its" own: you're just selecting a persona and finetuning it to prominence as the default prediction target of the network. In Chain of Thought, we call this persona "I" out of lexical convention, but there's really no difference between that and autocompleting characters in books.

(Note: Obviously I think this is all silly and there's also no difference between all that and actually having an identity/making plans. But I don't even think that's necessary to debate. So long as the right letters come out, who cares what we label the pattern that generates them? As gwern memorably put it: "The deaths, however, are real.")

replies(1): >>40740681 #

dahart ◴[20 Jun 24 16:45 UTC] No.40740681[source]▶

>>40735697 #

> it’s not a matter of GPT not being capable of it but a matter of the instruction tuning training it out.

GPT training has made zero attempts to prevent curiosity or boredom or agency, I think your statement is either incorrect or misunderstood my point. There is a small amount of fine-tuning negative emotion and blatant misinformation responses away, but otherwise the Park et. al. “Generate Agents” paper is using a different architecture from GPT and specifically says that an LLM by itself is not capable of making believable plans (!), and they warn “We suggest that generative agents should never be a substitute for real human input in studies and design processes.” We have some idea about how to make AI permanently learn to autocomplete about things that weren’t in the training data, but GPT doesn’t have that yet, nor does any other AI to date.

> So long as the right letters come out, who cares what we label the pattern that generates them?

I guess I have to admit caring, and I’m curious why you imply you don’t. (I suspect you actually do care, and so does everyone, and this is why we’re all talking about it.) The difference between agency and convincing autocomplete is the difference between AGI and not AGI. It seems like the only relevant question to me. The answer to what we label the pattern generator is going to shape how we create and use AI from here out, it will define what rights AI has, and who gets credit for advances and blame for mistakes.

Are you essentially arguing that you think humans are autocomplete and nothing more? If so, go back and read the link I posted about “GPT-4 Can’t Reason”, it has some informative analysis that draws real and concrete distinctions between LLMs and human reasoning. We can in fact prove that GPT’s capabilities are strictly a subset of people & animal reasoning abilities. None of this contradicts the idea that GPT can be a useful tool, nor that it can mimic some human behaviors. But the theme I’m seeing in the pro-AI arguments is a talking point that since we don’t fully understand human consciousness and can’t define it in such a way that excludes today’s AI, then GPT is probably AGI already. I’m not sure that logic fits your claim per se, but that logic is fallacious (regardless of whether the conclusion is true or false). That logic is seeking affirmation, attempting to bring humans down to digital neural network level, and it only shows that we haven’t drawn a line yet, it doesn’t get us any closer to whether there is a line. We’ve talked about a bunch of bits of evidence that a line does exist. The thing missing here is a serious attempt at proving the null hypothesis.

replies(1): >>40746944 #

1. FeepingCreature ◴[21 Jun 24 07:03 UTC] No.40746944[source]▶

>>40740681 #

> the Park et. al. “Generate Agents” paper is using a different architecture from GPT

It's using a wrapper around 3.5.

> and specifically says that an LLM by itself is not capable of making believable plans (!)

I don't see where it says that and at any rate I suspect it's false. Careful to equivocate between "We couldn't get it to make a plan" and "it cannot make plans". Many people have decided that LLMs are incapable of a thing on the basis of very bad prompts.

> and they warn “We suggest that generative agents should never be a substitute for real human input in studies and design processes.”

I suspect they're referring to generative agents at the current level of skill. I don't think they mean it as "ever, under any circumstances, no matter how capable".

> We have some idea about how to make AI permanently learn to autocomplete about things that weren’t in the training data, but GPT doesn’t have that yet, nor does any other AI to date.

I don't even think humans have that. We simply have a very abstracted library of patterns. "Things that aren't in the training data" don't look like an unusual circumstance, they look like random noise. So long as we can phrase it in understandable terms, it's by definition not a situation outside the training data.

> Are you essentially arguing that you think humans are autocomplete and nothing more?

I view it the other way around: I think "autocomplete" is such a generic term that it can fit anything we do. It's like saying humans are "just computers" - like, yes, I think the function our brains evaluate is computable, but that doesn't actually put any restrictions and what it can be. Any worldmodel can be called "autocomplete".

> But the theme I’m seeing in the pro-AI arguments is a talking point that since we don’t fully understand human consciousness and can’t define it in such a way that excludes today’s AI, then GPT is probably AGI already.

To be clear, I think GPT is AGI for other reasons. The arguments about consciousness simply fail to justify excluding it. I think GPT is AGI because when I try to track the development of AI, I evaluate something like "which capabilities do I, a human, have? Which capabilities do I know GPT can simulate? What's left necessary to make them match up?" GPT will naturally generate analogues to these capabilities simply as a matter of backprop over a human data corpus; if it fails, it will be due to insufficient scale, inadequate design, inadequate training, etc. So then I look at: "how does it fail?" What's the step of something where I make a decision, where I introspectively go zig, and GPT goes zag? And in my model, none of the remaining weaknesses and inabilities are things that the transformer architecture cannot represent. My view is if you got God to do it, He could probably turn GPT 3.5 into an AGI by precisely selecting the right weights and then writing a very thin wrapper. I think the fact that we cannot find those weights is much more down to the training corpus than anything architectural. When I look at GPT 3.5 reason through a problem, I recognize my internal narrative; conversely, when I make an amusing blooper IRL, I occasionally recognize where my brain autocompleted a pattern a bit too readily.

Of course, the oft-repeated pattern of "GPT will never be able to X" :next day: "We present a prompt that gets GPT to do X" also doesn't help dissuade me.

Like, "GPT can't think, it can only follow prompts". Prompts are a few lines of text. GPT is a text-generator. Do you really think prompts are going to be, in the long term, the irreducibly human technology that keeps GPT from parity with us? If GPT can be AGI if only for the ability to make prompts, we're one good prompt generation dataset out from AGI.

↑