But before starting your sentence, you internally formulate the gist of the sentence you're going to say.
Which is exactly what happens in LLMs latent space too before they start outputting the first token.
That's why saying "it's just predicting the next word", is a misguided take.
I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.
Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.
I can confess to not always knowing where I'll end up when I start talking. Similarly, not every time I open my mouth it's just to start but sometimes I do have a goal and conclusion.
Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?
Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.
Taking a step back to the start, we're wondering:
Do LLMs plan for token N + X, while purely working to output token N?
TL;DR: yes.
via https://www.anthropic.com/research/tracing-thoughts-language....
Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.
I did it multiple times while writing this comment, and it is only four sentences. The previous sentence once said "two sentences," and after I added this statement it was changed to "four sentences."
This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.
If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?
(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)
Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.
Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.
Words rise from an abyss and are served to you, you have zero insight into their formation. If I tell you to think of an animal, one just appears in your "context", how it got there is unknown.
So really there is no argument to be made, because we still don't mechanistically understand how the brain works.
It's statements like these that make me wonder if I am the same species as everyone else. Quite often, I've picked adjectives and idioms first, and then fill in around them to form sentences. Often because there is some pun or wordplay, or just something that has a nice ring to it, and I want to lead my words in that direction. If you're only choosing them one at a time and sequentially, have you ever considered that you might just be a dimwit?
It's not like you don't see this happening all around you in others. Sure you can't read minds, but have you never once watched someone copyedit something they've written, where they move phrases and sentences around, where they switch out words for synonyms, and so on? There are at least dozens of fictional scenes in popular media, you must have seen one. You have to have noticed hints at some point in your life that this occurs. Please. Just tell me that you spoke hastily to score internet argument points, and that you don't believe this thing you've said.
In order to model poetry autoregressively, you're going to need a variable that captures rhyme scheme. At the point where you've ended the first line, the model needs to keep track of the rhyme that was used, just like it does for something like coreference resolution.
I don't think that the mentioned paper shows that the model engages in a preplanning phase in which it plans the rhyme that will come. In fact such would be impossible. Model state is present only in so-far-generated text. It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable. (Now yes, as you increase the posterior probability of 'being in a poem' given context so far, you would expect that you also increase the probability of the rhyme-scheme variable's existing.)
Might I trouble you for help getting from there to “such would be impossible”, where such is “the model…plans the rhyme to come”
Edit: I’m surprised to be at -2 for this. I am representing the contents of the post accurately. Its unintuitive for sure, but, it’s the case.
When I read that text, something like this happens:
Visual perception of text (V1, VWFA) → Linguistic comprehension (Angular & Temporal Language Areas) → Semantic activation (Temporal + Hippocampal Network) → Competitive attractor stabilization (Prefrontal & Cingulate) → Top-down visual reactivation (Occipital & Fusiform) → Conscious imagery (Prefrontal–Parietal–Thalamic Loop).
and you can find experts in each of those areas who understand the specifics a lot more.
The first person experience of having a thought, to me, feels like I have the whole thought in my head, and then I imagine expressing it to somebody one word at a time. But it really feels like I’m reading out the existing thought.
Then, if I’m thinking hard, I go around a bit and argue against the thought that was expressed in my head (either because it is not a perfect representation of the actual underlying thought, or maybe because it turns out that thought was incorrect once I expressed it sequentially).
At least that’s what I think thinking feels like. But, I am just a guy thinking about my brain. Surely philosophers of the mind or something have queried this stuff with more rigor.
You are undoubtedly technically correct, but I prefer the simplicity, purity and ease-of-use of the abysmal model, especially in comparison with other similar competing models, such as the below-discussed "tarpit" model.
Clearly communication is sequential.
LLMs are not more sequential than your vocal chords or your hand writing. They also plan ahead before writing.
But I can't believe the actual literal words are diffusing when you're thinking.
When being asked: "How are you today", there is no way that your thoughts are literally like "Alpha zulu banana" => "I banana coco" => "I banana good" => "I am good". The diffusion does not happen at the output token layer, it happens much earlier at a higher level of abstraction.
"I ____ ______ ______ ______ and _____ _____ ______ ____ the ____ _____ _____ _____."
If the images in the article are to be considered an accurate representation, the model is putting meaningless bits of connective tissue way before the actual ideas. Maybe it's not working like that. But the "token-at-a-time" model is also obviously not literally looking at only one word at a time either.
Wrong. There's "model state", (I assume you mean hidden layers), not just in the generated text, but also in the initial prompt given to the model. I.e. the model can start its planning from the moment it's given the instruction, without even having predicted a token yet. That's actually what they show in the paper above...
> It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable
This is an assertion based on flawed reasoning.
(Also, these ideas should really be backed up by evidence and experimentation before asserting them so definitively.)
What happens in the black box of the human mind to determine the next word to write/say is exactly made irrelevant in this level of abstraction, as regardless how, it would still result in a linear sequence of actions as observed by the environment.
(n.b. been here 16 years and this is such a classic downvote scenario the past two years. people overindexing on big words that are familiar to them, and on any sort of challenging tone. That's almost certainly why I got mine, I was the dummy who read the article and couldn't grasp the stats nonsense, and "could I bother you to help" or w/e BS I said, well, was BS)