Most active commenters

sailingparrot(13)
wizzwizz4(4)
crubier(3)
froobius(3)
refulgentis(3)
aidenn0(3)
HarHarVeryFunny(3)
bjourne(3)
nl(3)

Popular/hot comments

>>45645546 #
>>45645401 #
>>45645973 #
>>45645712 #
>>45648362 #

←back to thread

BERT is just a single text diffusion step

(nathan.rs)

1. kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

>>45644328 (OP) #

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

2. cube2222 ◴[20 Oct 25 15:55 UTC] No.45645350[source]▶

>>45645307 (TP) #

I will very often write a message on slack, only to then edit it 5 times… Now I always feel like a diffusion model when I do that.

replies(1): >>45647460 #

3. aabhay ◴[20 Oct 25 15:57 UTC] No.45645383[source]▶

>>45645307 (TP) #

The fact that you’re cognitively aware is evidence that this is nowhere near diffusion. More like rumination or thinking tokens, if we absolutely had to find a present day LLM metaphor

4. crubier ◴[20 Oct 25 15:58 UTC] No.45645401[source]▶

>>45645307 (TP) #

You 100% do pronounce or write words one at a time sequentially.

But before starting your sentence, you internally formulate the gist of the sentence you're going to say.

Which is exactly what happens in LLMs latent space too before they start outputting the first token.

replies(5): >>45645466 #>>45645546 #>>45645695 #>>45645968 #>>45646205 #

5. HPsquared ◴[20 Oct 25 15:58 UTC] No.45645402[source]▶

>>45645307 (TP) #

Maybe it's two different modes of thinking. I can have thoughts that coalesce from the ether, but also sometimes string a thought together linearly. Brains might be able to do both.

6. froobius ◴[20 Oct 25 16:04 UTC] No.45645466[source]▶

>>45645401 #

(Just to expand on that, it's true not just the for the first token. There's a lot of computation, including potentially planning ahead, before each token outputted.)

That's why saying "it's just predicting the next word", is a misguided take.

7. EGreg ◴[20 Oct 25 16:07 UTC] No.45645509[source]▶

>>45645307 (TP) #

I feel completely the opposite way.

When you speak or do anything, you focus on what you’re going do next. Your next action. And at that moment you are relying on your recent memory, and things you have put in place while doing the overall activity (context).

In fact what’s actually missing from AI currently is simultaneous collaboration, like a group of people interacting — it is very 1 on 1 for now. Like human conversations.

Diffusion is like looking at a cloud and trying to find a pattern.

replies(1): >>45646015 #

8. ma2rten ◴[20 Oct 25 16:08 UTC] No.45645523[source]▶

>>45645307 (TP) #

Interpretability research has found that Autoregressive LLMs also plan ahead what they are going to say.

replies(2): >>45645712 #>>45646027 #

9. taeric ◴[20 Oct 25 16:10 UTC] No.45645546[source]▶

>>45645401 #

I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?

I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.

Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.

replies(7): >>45645580 #>>45645621 #>>45646119 #>>45646153 #>>45646165 #>>45647044 #>>45647828 #

10. CaptainOfCoit ◴[20 Oct 25 16:13 UTC] No.45645580{3}[source]▶

>>45645546 #

I think there is a wide range of ways to "turn something in the head into words", and sometimes you use the "this is the final point, work towards it" approach and sometimes you use the "not sure what will happen, lets just start talking and go wherever". Different approaches have different tradeoffs, and of course different people have different defaults.

I can confess to not always knowing where I'll end up when I start talking. Similarly, not every time I open my mouth it's just to start but sometimes I do have a goal and conclusion.

11. silveraxe93 ◴[20 Oct 25 16:16 UTC] No.45645607[source]▶

>>45645307 (TP) #

That's why I'm very excited by Gemini diffusion[1].

- [1] https://deepmind.google/models/gemini-diffusion/

12. refulgentis ◴[20 Oct 25 16:16 UTC] No.45645621{3}[source]▶

>>45645546 #

It's just too far of an analogy, it starts in the familiar SWE tarpit of human brain = lim(n matmuls) as n => infinity.

Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?

Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.

Taking a step back to the start, we're wondering:

Do LLMs plan for token N + X, while purely working to output token N?

TL;DR: yes.

via https://www.anthropic.com/research/tracing-thoughts-language....

Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.

replies(1): >>45646235 #

13. dudu24 ◴[20 Oct 25 16:20 UTC] No.45645665[source]▶

>>45645307 (TP) #

That is not contrary to token-at-a-time approach.

14. tripplyons ◴[20 Oct 25 16:20 UTC] No.45645670[source]▶

>>45645307 (TP) #

Here's a blog post I liked that explains a connection: https://sander.ai/2024/09/02/spectral-autoregression.html

They call diffusion a form of "spectral autoregression", because it tends to first predict lower frequency features, and later predict higher frequency features.

15. smokel ◴[20 Oct 25 16:22 UTC] No.45645695[source]▶

>>45645401 #

For most serious texts I start with a tree outline, before I engage my literary skills.

16. aidenn0 ◴[20 Oct 25 16:24 UTC] No.45645712[source]▶

>>45645523 #

This seems likely just from the simple fact that they can reliably generate contextually correct sentences in e.g. German Imperfekt.

replies(3): >>45651812 #>>45651822 #>>45653730 #

17. ◴[20 Oct 25 16:39 UTC] No.45645891[source]▶

>>45645307 (TP) #

18. pessimizer ◴[20 Oct 25 16:44 UTC] No.45645968[source]▶

>>45645401 #

Like most people I jump back and forth when I speak, disclaiming, correcting, and appending to previous utterances. I do this even more when I write, eradicating entire sentences and even the ideas they contain, within paragraphs that which by the time they were finished the sentence seemed unnecessary or inconsistent.

I did it multiple times while writing this comment, and it is only four sentences. The previous sentence once said "two sentences," and after I added this statement it was changed to "four sentences."

19. sailingparrot ◴[20 Oct 25 16:44 UTC] No.45645973[source]▶

>>45645307 (TP) #

> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

replies(5): >>45646422 #>>45650316 #>>45654585 #>>45656793 #>>45663541 #

20. thamer ◴[20 Oct 25 16:48 UTC] No.45646027[source]▶

>>45645523 #

The March 2025 blog post by Anthropic titled "Tracing the thoughts of a large language model"[1] is a great introduction to this research, showing how their language model activates features representing concepts that will eventually get connected at some later point as the output tokens are produced.

The associated paper[2] goes into a lot more detail, and includes interactive features that help illustrate how the model "thinks" ahead of time.

[1] https://www.anthropic.com/research/tracing-thoughts-language...

[2] https://transformer-circuits.pub/2025/attribution-graphs/bio...

21. btown ◴[20 Oct 25 16:52 UTC] No.45646119{3}[source]▶

>>45645546 #

> far more cognizant of the last thing that the they want to say when they start

This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.

If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?

(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)

Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.

22. jrowen ◴[20 Oct 25 16:54 UTC] No.45646153{3}[source]▶

>>45645546 #

They're speaking literally. When talking to someone (or writing), you ultimately say the words in order (edits or corrections notwithstanding). If you look at the gifs of how the text is generated - I don't know of anyone that has ever written like that. Literally writing disconnected individual words of the actual draft ("during," "and," "the") in the middle of a sentence and then coming back and filling in the rest. Even speaking like that would be incredibly difficult.

Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.

replies(1): >>45647690 #

23. Workaccount2 ◴[20 Oct 25 16:55 UTC] No.45646165{3}[source]▶

>>45645546 #

People don't come up with things their brain does.

Words rise from an abyss and are served to you, you have zero insight into their formation. If I tell you to think of an animal, one just appears in your "context", how it got there is unknown.

So really there is no argument to be made, because we still don't mechanistically understand how the brain works.

replies(1): >>45646871 #

24. NoMoreNicksLeft ◴[20 Oct 25 16:58 UTC] No.45646205[source]▶

>>45645401 #

>You 100% do pronounce or write words one at a time sequentially.

It's statements like these that make me wonder if I am the same species as everyone else. Quite often, I've picked adjectives and idioms first, and then fill in around them to form sentences. Often because there is some pun or wordplay, or just something that has a nice ring to it, and I want to lead my words in that direction. If you're only choosing them one at a time and sequentially, have you ever considered that you might just be a dimwit?

It's not like you don't see this happening all around you in others. Sure you can't read minds, but have you never once watched someone copyedit something they've written, where they move phrases and sentences around, where they switch out words for synonyms, and so on? There are at least dozens of fictional scenes in popular media, you must have seen one. You have to have noticed hints at some point in your life that this occurs. Please. Just tell me that you spoke hastily to score internet argument points, and that you don't believe this thing you've said.

replies(2): >>45647651 #>>45652847 #

25. jsrozner ◴[20 Oct 25 17:00 UTC] No.45646235{4}[source]▶

>>45645621 #

Let's just not call it planning.

In order to model poetry autoregressively, you're going to need a variable that captures rhyme scheme. At the point where you've ended the first line, the model needs to keep track of the rhyme that was used, just like it does for something like coreference resolution.

I don't think that the mentioned paper shows that the model engages in a preplanning phase in which it plans the rhyme that will come. In fact such would be impossible. Model state is present only in so-far-generated text. It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable. (Now yes, as you increase the posterior probability of 'being in a poem' given context so far, you would expect that you also increase the probability of the rhyme-scheme variable's existing.)

replies(2): >>45646312 #>>45648785 #

26. refulgentis ◴[20 Oct 25 17:06 UTC] No.45646312{5}[source]▶

>>45646235 #

I’m confused: the blog shows they A) predict the end of line 2 using the state at the end of line 1 and B) can choose the end of line 2 by altering state at end of line 1.

Might I trouble you for help getting from there to “such would be impossible”, where such is “the model…plans the rhyme to come”

Edit: I’m surprised to be at -2 for this. I am representing the contents of the post accurately. Its unintuitive for sure, but, it’s the case.

replies(1): >>45655442 #

27. wizzwizz4 ◴[20 Oct 25 17:14 UTC] No.45646422[source]▶

>>45645973 #

If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.

replies(2): >>45646920 #>>45647495 #

28. aeonik ◴[20 Oct 25 17:48 UTC] No.45646871{4}[source]▶

>>45646165 #

We don't know exactly how consciousness works in the human brain, but we know way more than "comes from the abyss".

When I read that text, something like this happens:

Visual perception of text (V1, VWFA) → Linguistic comprehension (Angular & Temporal Language Areas) → Semantic activation (Temporal + Hippocampal Network) → Competitive attractor stabilization (Prefrontal & Cingulate) → Top-down visual reactivation (Occipital & Fusiform) → Conscious imagery (Prefrontal–Parietal–Thalamic Loop).

and you can find experts in each of those areas who understand the specifics a lot more.

replies(1): >>45647195 #

29. jama211 ◴[20 Oct 25 17:52 UTC] No.45646920{3}[source]▶

>>45646422 #

It also doesn’t mean they’re doing it inefficiently.

replies(1): >>45647093 #

30. bee_rider ◴[20 Oct 25 18:01 UTC] No.45647044{3}[source]▶

>>45645546 #

It must be the case that some smart people have studied how we think, right?

The first person experience of having a thought, to me, feels like I have the whole thought in my head, and then I imagine expressing it to somebody one word at a time. But it really feels like I’m reading out the existing thought.

Then, if I’m thinking hard, I go around a bit and argue against the thought that was expressed in my head (either because it is not a perfect representation of the actual underlying thought, or maybe because it turns out that thought was incorrect once I expressed it sequentially).

At least that’s what I think thinking feels like. But, I am just a guy thinking about my brain. Surely philosophers of the mind or something have queried this stuff with more rigor.

31. pinkmuffinere ◴[20 Oct 25 18:05 UTC] No.45647093{4}[source]▶

>>45646920 #

I read this to mean “just because the process doesn’t match the problem, that doesn’t mean it’s inefficient”. But I think it does mean that. I expect we intuitively know that data structures which match the structure of a problem are more efficient than those that don’t. I think the same thing applies here.

I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing

replies(2): >>45647687 #>>45658371 #

32. giardini ◴[20 Oct 25 18:14 UTC] No.45647195{5}[source]▶

>>45646871 #

aeonik says >"We don't know exactly how consciousness works in the human brain, but we know way more than "comes from the abyss"."<

You are undoubtedly technically correct, but I prefer the simplicity, purity and ease-of-use of the abysmal model, especially in comparison with other similar competing models, such as the below-discussed "tarpit" model.

33. djmips ◴[20 Oct 25 18:35 UTC] No.45647460[source]▶

>>45645350 #

Coding feels like that to me as well.

34. flux3125 ◴[20 Oct 25 18:37 UTC] No.45647491[source]▶

>>45645307 (TP) #

It feels like a mix of both to me, diffusion "chunks" being generated in sequence. As I write this comment, I'm deciding on the next word while also shaping the next sentence, like turning a fuzzy idea into a clear sequence.

35. sailingparrot ◴[20 Oct 25 18:38 UTC] No.45647495{3}[source]▶

>>45646422 #

I’m not arguing about efficiency though ? Simply saying next token predictors cannot be thought of as actually just thinking about the next token with no long term plan.

replies(1): >>45648362 #

36. crubier ◴[20 Oct 25 18:50 UTC] No.45647651{3}[source]▶

>>45646205 #

Are you able to pronounce multiple words in superposition at the same time? Are you able to write multiple words in superposition? Can you read the following sentence: "HWeolrllod!"

Clearly communication is sequential.

LLMs are not more sequential than your vocal chords or your hand writing. They also plan ahead before writing.

37. wizzwizz4 ◴[20 Oct 25 18:52 UTC] No.45647687{5}[source]▶

>>45647093 #

Example: a list of (key, value) pairs is a perfectly valid way to implement a map, and suffices. However, a more complicated tree structure, perhaps with hashed keys, is usually way more efficient, which is increasingly-noticeable as the number of pairs stored in the map grows large.

38. tekne ◴[20 Oct 25 18:52 UTC] No.45647690{4}[source]▶

>>45646153 #

Weird anecdote, but one of the reasons I have always struggled with writing is precisely that my process seems highly nonlinear. I start with a disjoint mind map of ideas I want to get out, often just single words, and need to somehow cohere that into text, which often happens out-of-order. The original notes are often completely unordered diffusion-like scrawling, the difference being I have less idea what final the positions of the words were going to be when I wrote them.

replies(1): >>45648043 #

39. ◴[20 Oct 25 19:03 UTC] No.45647828{3}[source]▶

>>45645546 #

40. crubier ◴[20 Oct 25 19:22 UTC] No.45648043{5}[source]▶

>>45647690 #

I can believe that your abstract thoughts in latent space are diffusing/forming progressively when you are thinking.

But I can't believe the actual literal words are diffusing when you're thinking.

When being asked: "How are you today", there is no way that your thoughts are literally like "Alpha zulu banana" => "I banana coco" => "I banana good" => "I am good". The diffusion does not happen at the output token layer, it happens much earlier at a higher level of abstraction.

replies(1): >>45648242 #

41. jrowen ◴[20 Oct 25 19:37 UTC] No.45648242{6}[source]▶

>>45648043 #

Or like this:

"I ____ ______ ______ ______ and _____ _____ ______ ____ the ____ _____ _____ _____."

If the images in the article are to be considered an accurate representation, the model is putting meaningless bits of connective tissue way before the actual ideas. Maybe it's not working like that. But the "token-at-a-time" model is also obviously not literally looking at only one word at a time either.

42. wizzwizz4 ◴[20 Oct 25 19:47 UTC] No.45648362{4}[source]▶

>>45647495 #

They rebuild the "long term plan" anew for every token: there's no guarantee that the reconstructed plan will remain similar between tokens. That's not how planning normally works. (You can find something like this every time there's this kind of gross inefficiency, which is why I gave the general principle.)

replies(3): >>45648898 #>>45648950 #>>45651754 #

43. naasking ◴[20 Oct 25 20:02 UTC] No.45648578[source]▶

>>45645307 (TP) #

> Speaking for myself, I don't generate words one a time based on previously spoken words

This is a common but fundamentally a weird assumption people have about neurology where they think that what they consciously perceive has some bearing on what's actually happening at the operational or physical level.

44. froobius ◴[20 Oct 25 20:19 UTC] No.45648785{5}[source]▶

>>45646235 #

> Model state is present only in so-far-generated text

Wrong. There's "model state", (I assume you mean hidden layers), not just in the generated text, but also in the initial prompt given to the model. I.e. the model can start its planning from the moment it's given the instruction, without even having predicted a token yet. That's actually what they show in the paper above...

> It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable

This is an assertion based on flawed reasoning.

(Also, these ideas should really be backed up by evidence and experimentation before asserting them so definitively.)

45. sailingparrot ◴[20 Oct 25 20:29 UTC] No.45648898{5}[source]▶

>>45648362 #

> They rebuild the "long term plan" anew for every token

Well no, there is attention in the LLM which allows it to look back at it's "internal thought" during the previous tokens.

Token T at layer L, can attend to a projection of the hidden states of all tokens < T at L. So its definitely not starting anew at every token and is able to iterate on an existing plan.

Its not a perfect mechanism for sure, and there is work to make LLMs able to carry more information forward (e.g. feedback transformers), but they can definitely do some of that today.

replies(1): >>45649444 #

46. HarHarVeryFunny ◴[20 Oct 25 20:34 UTC] No.45648950{5}[source]▶

>>45648362 #

Actually, due to using causal (masked) attention, new tokens appended to the input don't have any effect on what's calculated internally (the "plan") at earlier positions in the input, and a modern LLM therefore uses a KV cache rather than recalculating at those earlier positions.

In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.

replies(1): >>45649010 #

47. astrange ◴[20 Oct 25 20:38 UTC] No.45649010{6}[source]▶

>>45648950 #

You can violate the plan in the sampler by making an "unreasonable" choice of next token to sample (eg by raising the temperature.) So if it does stick to the same plan, it's not going to be a very good one.

replies(1): >>45649359 #

48. HarHarVeryFunny ◴[20 Oct 25 21:03 UTC] No.45649359{7}[source]▶

>>45649010 #

Yeah.

Karpathy recently referred to LLMs having more "working memory" than a human, apparently referring to these unchanging internal activations as "memory", but it's an odd sort of "working memory" if you can't actually update it to reflect progress on what you are working on, or update per new information (new unexpected token having been sampled).

replies(1): >>45649723 #

49. wizzwizz4 ◴[20 Oct 25 21:12 UTC] No.45649444{6}[source]▶

>>45648898 #

This isn't the same as planning. Consider what happens when tokens from another source are appended.

replies(1): >>45649590 #

50. sailingparrot ◴[20 Oct 25 21:27 UTC] No.45649590{7}[source]▶

>>45649444 #

I don't follow how this relates to what we are discussing. Autoregressive LLMs are able to plan within a single forward pass and are able to look back at their previous reasoning and do not start anew at each token like you said.

If you append tokens from another source, like in a turn base conversation, then the LLM will process all the new appended tokens in parallel while still being able to look back at it's previous internal state (and thus past reasoning/planning in latent space) from the already processed tokens, then will adjust the plan based on the new information.

What happens to you as a human if you come up with a plan with limited information and new information is provided to you?

replies(1): >>45651189 #

51. sailingparrot ◴[20 Oct 25 21:37 UTC] No.45649723{8}[source]▶

>>45649359 #

I think a better mental framework of how those model work is that they keep an history of the state of their "memory" across time.

Where humans have a single evolving state of our memory LLMs have access to all the states of their "memories" across time, and while past state can't be changed, the new state can: This is the current token's hidden state, and to form this new state they look both at the history of previous states as well as the new information (last token having been sample, or external token from RAG or whatnot appended to the context).

This is how progress is stored.

replies(1): >>45650077 #

52. HarHarVeryFunny ◴[20 Oct 25 22:11 UTC] No.45650077{9}[source]▶

>>45649723 #

Thanks, that's a useful way to think about it.

Presumably the internal state at any given token position must also be encoding information specific to that position, as well as this evolving/current memory... So, can this be seen in the internal embeddings - are they composed of a position-dependent part that changes a lot between positions, and an evolving memory part that is largely similar between positions only changing slowly?

Are there any papers or talks discussing this ?

replies(1): >>45650379 #

53. bjourne ◴[20 Oct 25 22:37 UTC] No.45650316[source]▶

>>45645973 #

That is precisely what autoregressive means. Perhaps you meant to write that modern LLMs are not strictly autoregressive?

replies(1): >>45650411 #

54. sailingparrot ◴[20 Oct 25 22:44 UTC] No.45650379{10}[source]▶

>>45650077 #

I don't remember any paper looking at this specific question (thought it might be out there), but in general Anthropic's circuit threads series of article is very good on the broader subject: https://transformer-circuits.pub

55. janalsncm ◴[20 Oct 25 22:48 UTC] No.45650411{3}[source]▶

>>45650316 #

I think they are distinguishing the mechanical process of generation from the way the idea exists. It’s the same as how a person can literally only speak one word at a time but the ideas might be nonlinear.

replies(2): >>45650595 #>>45659970 #

56. sailingparrot ◴[20 Oct 25 23:12 UTC] No.45650595{4}[source]▶

>>45650411 #

Indeed what I meant. The LLM isn’t a blank slate at the beginning of each new token during autoregression as the kv cache is there.

57. LelouBil ◴[21 Oct 25 00:40 UTC] No.45651189{8}[source]▶

>>45649590 #

Not the original person you are replying to, but I wanted to add:

Yes, they can plan within a single forward pass like you said, but I still think they "start anew at each token" because they have no state/memory that is not the output.

I guess this is differing interpretations of the meaning of "start anew", but personally I would agree that having no internal state and simply looking back at it's previous output to form a new token is "starting anew".

But I'm also not well informed about the topic so happy to be corrected.

replies(2): >>45651765 #>>45652292 #

58. nl ◴[21 Oct 25 02:20 UTC] No.45651754{5}[source]▶

>>45648362 #

Right, and this is what "reasoning LLMs" work around by having explicitly labelly "reasoning tokens".

This lets them "save" the plan between tokens, so when regenerating the new token it is following the plan.

59. nl ◴[21 Oct 25 02:22 UTC] No.45651765{9}[source]▶

>>45651189 #

Worth noting here for others following that a single forward pass is what generates a single token.

It's correct to states the LLM starts anew for each token.

The work around for this is to pass the existing plan back into it as part of the context.

replies(1): >>45652333 #

60. ma2rten ◴[21 Oct 25 02:30 UTC] No.45651812{3}[source]▶

>>45645712 #

It's actually true on many levels, if you think about is needed for generating syntactically and grammatically correct sentences, coherent text and working code.

replies(1): >>45658031 #

61. treis ◴[21 Oct 25 02:31 UTC] No.45651822{3}[source]▶

>>45645712 #

I don't think you're wrong but I don't think your logic holds up here. If you have a literal translation like:

I have a hot dog _____

The word in the blank is not necessarily determined when the sentenced is started. Several verbs fit at the end and the LLM doesn't need to know which it's going to pick when it starts. Each word narrows down the possibilities:

I - Trillions Have - Billions a - millions hot - thousands dog - dozens _____ - Could be eaten, cooked, thrown, whatever.

If it chooses cooked at this point that doesn't necessarily mean that the LLM was going to do that when it chose "I" or "have"

replies(1): >>45652378 #

62. sailingparrot ◴[21 Oct 25 03:50 UTC] No.45652292{9}[source]▶

>>45651189 #

But you are missing the causal attention from your analysis. The output is not the only thing that is preserved, there is also the KV-cache.

At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache.

At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1.

At token 3, it can see the states from token 2 and 1 etc.

However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones.

replies(1): >>45655087 #

63. sailingparrot ◴[21 Oct 25 03:58 UTC] No.45652333{10}[source]▶

>>45651765 #

You are forgetting about attention on the kv-cache, which is the mechanism that allows LLM to not start anew everytime.

replies(1): >>45652519 #

64. aidenn0 ◴[21 Oct 25 04:05 UTC] No.45652378{4}[source]▶

>>45651822 #

That's why I hedged with "seems likely" and added "in context." If this is in the middle of a paragraph, then there are many fewer options to fit in the blank from the very start.

65. nl ◴[21 Oct 25 04:38 UTC] No.45652519{11}[source]▶

>>45652333 #

I mean sure, but that is a performance optimization that doesn't really change what is going on.

It's still recalculating, just that intermediate steps are cached.

replies(1): >>45654589 #

66. stevenhuang ◴[21 Oct 25 05:46 UTC] No.45652847{3}[source]▶

>>45646205 #

All of that can can still be seen as a linear sequence of actions from the perspective of human I/O with the environment.

What happens in the black box of the human mind to determine the next word to write/say is exactly made irrelevant in this level of abstraction, as regardless how, it would still result in a linear sequence of actions as observed by the environment.

67. wonnage ◴[21 Oct 25 05:55 UTC] No.45652892[source]▶

>>45645307 (TP) #

LLMs are notoriously bad at reflecting on how they work and I feel like humans are probably in the same boat

68. rcxdude ◴[21 Oct 25 08:36 UTC] No.45653730{3}[source]▶

>>45645712 #

And, to pick an example from the research, being able to generate output that rhymes. In fact, it's hard to see how you would produce anything that would be considered coherent text without some degree of planning ahead at some level of abstraction. If it was truly one token at a time without any regard for what comes next it would constantly 'paint itself into a corner' and be forced to produce nonsense (which, it seems, does still happen sometimes, but without any planning it would occur constantly).

69. getnormality ◴[21 Oct 25 11:30 UTC] No.45654585[source]▶

>>45645973 #

You're right that there is long-term planning going on, but that doesn't contradict the fact that an autoregressive LLM does, in fact, literally generate words one at a time based on previously spoken words. Planning and action are different things.

70. sailingparrot ◴[21 Oct 25 11:31 UTC] No.45654589{12}[source]▶

>>45652519 #

Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?

But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.

A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.

If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.

71. LelouBil ◴[21 Oct 25 12:45 UTC] No.45655087{10}[source]▶

>>45652292 #

Like an other commenter said, isn't the KV cache a performance optimization to not have to redo work that was already done ? Or does it fundamentally alter the output of the LLM, and so preserves state that is not present in the output of the LLM ?

replies(1): >>45655552 #

72. froobius ◴[21 Oct 25 13:21 UTC] No.45655442{6}[source]▶

>>45646312 #

I agree, the post above you is patently wrong / hasn't read the paper they are dismissing. I also got multiple downvotes for disagreeing, with no actual rebuttal.

replies(1): >>45655646 #

73. sailingparrot ◴[21 Oct 25 13:29 UTC] No.45655552{11}[source]▶

>>45655087 #

Yes, it's "just" an optimization technique, in the sense that you could not have it and end up with the same result (given the same input sequence), just much slower.

Conceptually what matters is not the kv-cache but the attention. But IMHO thinking about how the model behave during inference, when outputting one token at a time and doing attention on the kv cache is much easier to grok than during training/prefilling where the kv cache is absent and everything happens in parallel (although they are mathematically equivalent).

The important part of my point, is that when the model is processing token N, it can check it's past internal state during token 1,...,N-1, and thus "see" its previous plan and reasoning, and iterate over it, rather than just repeating everything from scratch in each token's hidden state (with caveat, explained at the end).

token_1 ──▶ h₁ᴸ ────────┐

token_2 ──▶ h₂ᴸ ──attn──┼──▶ h₃ᴸ (refines reasoning)

token_3 ──▶ h₃ᴸ ──attn──┼──▶ h₄ᴸ (refines further)

And the kv-cache makes this persistent across time, so the entire system (LLM+cache) becomes effectively able to save its state, and iterate upon it at each token, and not have to start from scratch every time.

But ultimately its a markov-chain, so again mathematically, yes, you could just re-do the full computation all the time, and end up in the same place.

Caveat: Because token N at layer L can attend to all other tokens <N but only at layer L, it only allows it to see the how the reasoning was at that depth, not how it was after a full pass, so it's not a perfect information passing mechanism, and is more pyramidal than straight line. Hence why i referenced feedback transformers in another message. But the principle still applies that information is passing through time steps.

74. refulgentis ◴[21 Oct 25 13:36 UTC] No.45655646{7}[source]▶

>>45655442 #

You're my fav new-ish account, spent about 5 minutes Googling froobius yesterday tryna find more content. :) Concise, clear, no BS takes for high-minded nonsense that sounds technical. HNs such a hellhole for LLM stuff, the people who are hacking ain't here, and the people who are, well, they mostly like yapping about how it connects to some unrelated grand idea they misremember from undergrad. Cheers.

(n.b. been here 16 years and this is such a classic downvote scenario the past two years. people overindexing on big words that are familiar to them, and on any sort of challenging tone. That's almost certainly why I got mine, I was the dummy who read the article and couldn't grasp the stats nonsense, and "could I bother you to help" or w/e BS I said, well, was BS)

75. vbarrielle ◴[21 Oct 25 15:09 UTC] No.45656793[source]▶

>>45645973 #

There is some long term planning going on, but bad luck when sampling the next token can take the process out of rails, so it's not just an implementation detail.

76. aidenn0 ◴[21 Oct 25 16:48 UTC] No.45658031{4}[source]▶

>>45651812 #

Just generating syntactically and grammatically correct sentences doesn't need much lookahead; prefixes to sentences that cannot be properly completed are going to be extremely unlikely to be generated.

77. jama211 ◴[21 Oct 25 17:10 UTC] No.45658371{5}[source]▶

>>45647093 #

I suppose there’s something in what you’re saying, it’s just that’s it’s sorta vague and hard to parse for me. It also depends on the higher order problem space, for example: is it efficient if the problem is defined by “make something that can adapt to a problem space and solve it without manual engineering” rather than “make something with a long lead up time where you understand the problem space in advance and therefore have time to optimise the engine”. In the former, the neural network would indeed count as solving this efficiently, because of the given definition of the goal.

78. bjourne ◴[21 Oct 25 18:51 UTC] No.45659970{4}[source]▶

>>45650411 #

If so they are wrong. :) Autoregressive just means that the probability of the next token is just a function of the already seen/emitted tokens. Any "ideas that may exist" are entirely embedded in this sequence.

replies(1): >>45660565 #

79. sailingparrot ◴[21 Oct 25 19:32 UTC] No.45660565{5}[source]▶

>>45659970 #

> entirely embedded in this sequence.

Obviously wrong, as otherwise every model would predict exactly the same thing, it would not even be predicting anymore, simply decoding.

The sequence is not enough to reproduce the exact output, you also need the weights.

And the way the model work is by attending to its own internal state (weights*input) and refining it, both across the depth (layer) dimension and across the time (tokens) dimension.

The fact that you can get the model to give you the exact same output by fixing a few seeds, is only a consequence of the process being markovian, and is orthogonal to the fact that at each token position the model is “thinking” about a longer horizon than the present token and is able to reuse that representation at later time steps

replies(1): >>45662988 #

80. bjourne ◴[21 Oct 25 23:15 UTC] No.45662988{6}[source]▶

>>45660565 #

Well, that an autoregressive model has parameters does not mean it is not autoregressive. LLMs are not Markovian.

replies(1): >>45664399 #

81. janalsncm ◴[22 Oct 25 00:23 UTC] No.45663541[source]▶

>>45645973 #

To take a simple example, let’s say we ask an autoregressive model a yes/no factual question like “is 1+1=2?”. Then, we force the LLM to start with the wrong answer “No, “ and continue decoding.

An autoregressive model can’t edit the past. If it happens to sample the wrong first token (or we force it to in this case), there’s no going back. Of course there can be many more complicated lines of thinking as well where backtracking would be nice.

“Reasoning” LLMs tack this on with reasoning tokens. But the issue with this is that the LLM has to attend to every incorrect, irrelevant line of thinking which is at a minimum a waste and likely confusing.

As an analogy, in HN I don’t need to attend to every comment under a post in order to generate my next word. I probably just care about the current thread from my comment up to the OP. Of course a model could learn that relationship but that’s a huge waste of compute.

Text diffusion solves the whole problem entirely by allowing the model to simply revise the “no” to a “yes”. Very simple.

82. sailingparrot ◴[22 Oct 25 02:49 UTC] No.45664399{7}[source]▶

>>45662988 #

At no point have I argued that LLMs aren’t autoregressive, I am merely talking about LLMs ability to reason across time steps, so it seems we are talking past each other which won’t lead anywhere.

And yes, LLM can be studied under the lens of Markov processes: https://arxiv.org/pdf/2410.02724

Have a good day

↑