Most active commenters

sailingparrot(10)
wizzwizz4(4)
HarHarVeryFunny(3)
nl(3)

Popular/hot comments

>>45648362 #

←back to thread

BERT is just a single text diffusion step

(nathan.rs)

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

>>45644328 (OP) #

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

sailingparrot ◴[20 Oct 25 16:44 UTC] No.45645973[source]▶

>>45645307 #

> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

replies(5): >>45646422 #>>45650316 #>>45654585 #>>45656793 #>>45663541 #

1. wizzwizz4 ◴[20 Oct 25 17:14 UTC] No.45646422[source]▶

>>45645973 #

If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.

replies(2): >>45646920 #>>45647495 #

2. jama211 ◴[20 Oct 25 17:52 UTC] No.45646920[source]▶

>>45646422 (TP) #

It also doesn’t mean they’re doing it inefficiently.

replies(1): >>45647093 #

3. pinkmuffinere ◴[20 Oct 25 18:05 UTC] No.45647093[source]▶

>>45646920 #

I read this to mean “just because the process doesn’t match the problem, that doesn’t mean it’s inefficient”. But I think it does mean that. I expect we intuitively know that data structures which match the structure of a problem are more efficient than those that don’t. I think the same thing applies here.

I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing

replies(2): >>45647687 #>>45658371 #

4. sailingparrot ◴[20 Oct 25 18:38 UTC] No.45647495[source]▶

>>45646422 (TP) #

I’m not arguing about efficiency though ? Simply saying next token predictors cannot be thought of as actually just thinking about the next token with no long term plan.

replies(1): >>45648362 #

5. wizzwizz4 ◴[20 Oct 25 18:52 UTC] No.45647687{3}[source]▶

>>45647093 #

Example: a list of (key, value) pairs is a perfectly valid way to implement a map, and suffices. However, a more complicated tree structure, perhaps with hashed keys, is usually way more efficient, which is increasingly-noticeable as the number of pairs stored in the map grows large.

6. wizzwizz4 ◴[20 Oct 25 19:47 UTC] No.45648362[source]▶

>>45647495 #

They rebuild the "long term plan" anew for every token: there's no guarantee that the reconstructed plan will remain similar between tokens. That's not how planning normally works. (You can find something like this every time there's this kind of gross inefficiency, which is why I gave the general principle.)

replies(3): >>45648898 #>>45648950 #>>45651754 #

7. sailingparrot ◴[20 Oct 25 20:29 UTC] No.45648898{3}[source]▶

>>45648362 #

> They rebuild the "long term plan" anew for every token

Well no, there is attention in the LLM which allows it to look back at it's "internal thought" during the previous tokens.

Token T at layer L, can attend to a projection of the hidden states of all tokens < T at L. So its definitely not starting anew at every token and is able to iterate on an existing plan.

Its not a perfect mechanism for sure, and there is work to make LLMs able to carry more information forward (e.g. feedback transformers), but they can definitely do some of that today.

replies(1): >>45649444 #

8. HarHarVeryFunny ◴[20 Oct 25 20:34 UTC] No.45648950{3}[source]▶

>>45648362 #

Actually, due to using causal (masked) attention, new tokens appended to the input don't have any effect on what's calculated internally (the "plan") at earlier positions in the input, and a modern LLM therefore uses a KV cache rather than recalculating at those earlier positions.

In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.

replies(1): >>45649010 #

9. astrange ◴[20 Oct 25 20:38 UTC] No.45649010{4}[source]▶

>>45648950 #

You can violate the plan in the sampler by making an "unreasonable" choice of next token to sample (eg by raising the temperature.) So if it does stick to the same plan, it's not going to be a very good one.

replies(1): >>45649359 #

10. HarHarVeryFunny ◴[20 Oct 25 21:03 UTC] No.45649359{5}[source]▶

>>45649010 #

Yeah.

Karpathy recently referred to LLMs having more "working memory" than a human, apparently referring to these unchanging internal activations as "memory", but it's an odd sort of "working memory" if you can't actually update it to reflect progress on what you are working on, or update per new information (new unexpected token having been sampled).

replies(1): >>45649723 #

11. wizzwizz4 ◴[20 Oct 25 21:12 UTC] No.45649444{4}[source]▶

>>45648898 #

This isn't the same as planning. Consider what happens when tokens from another source are appended.

replies(1): >>45649590 #

12. sailingparrot ◴[20 Oct 25 21:27 UTC] No.45649590{5}[source]▶

>>45649444 #

I don't follow how this relates to what we are discussing. Autoregressive LLMs are able to plan within a single forward pass and are able to look back at their previous reasoning and do not start anew at each token like you said.

If you append tokens from another source, like in a turn base conversation, then the LLM will process all the new appended tokens in parallel while still being able to look back at it's previous internal state (and thus past reasoning/planning in latent space) from the already processed tokens, then will adjust the plan based on the new information.

What happens to you as a human if you come up with a plan with limited information and new information is provided to you?

replies(1): >>45651189 #

13. sailingparrot ◴[20 Oct 25 21:37 UTC] No.45649723{6}[source]▶

>>45649359 #

I think a better mental framework of how those model work is that they keep an history of the state of their "memory" across time.

Where humans have a single evolving state of our memory LLMs have access to all the states of their "memories" across time, and while past state can't be changed, the new state can: This is the current token's hidden state, and to form this new state they look both at the history of previous states as well as the new information (last token having been sample, or external token from RAG or whatnot appended to the context).

This is how progress is stored.

replies(1): >>45650077 #

14. HarHarVeryFunny ◴[20 Oct 25 22:11 UTC] No.45650077{7}[source]▶

>>45649723 #

Thanks, that's a useful way to think about it.

Presumably the internal state at any given token position must also be encoding information specific to that position, as well as this evolving/current memory... So, can this be seen in the internal embeddings - are they composed of a position-dependent part that changes a lot between positions, and an evolving memory part that is largely similar between positions only changing slowly?

Are there any papers or talks discussing this ?

replies(1): >>45650379 #

15. sailingparrot ◴[20 Oct 25 22:44 UTC] No.45650379{8}[source]▶

>>45650077 #

I don't remember any paper looking at this specific question (thought it might be out there), but in general Anthropic's circuit threads series of article is very good on the broader subject: https://transformer-circuits.pub

16. LelouBil ◴[21 Oct 25 00:40 UTC] No.45651189{6}[source]▶

>>45649590 #

Not the original person you are replying to, but I wanted to add:

Yes, they can plan within a single forward pass like you said, but I still think they "start anew at each token" because they have no state/memory that is not the output.

I guess this is differing interpretations of the meaning of "start anew", but personally I would agree that having no internal state and simply looking back at it's previous output to form a new token is "starting anew".

But I'm also not well informed about the topic so happy to be corrected.

replies(2): >>45651765 #>>45652292 #

17. nl ◴[21 Oct 25 02:20 UTC] No.45651754{3}[source]▶

>>45648362 #

Right, and this is what "reasoning LLMs" work around by having explicitly labelly "reasoning tokens".

This lets them "save" the plan between tokens, so when regenerating the new token it is following the plan.

18. nl ◴[21 Oct 25 02:22 UTC] No.45651765{7}[source]▶

>>45651189 #

Worth noting here for others following that a single forward pass is what generates a single token.

It's correct to states the LLM starts anew for each token.

The work around for this is to pass the existing plan back into it as part of the context.

replies(1): >>45652333 #

19. sailingparrot ◴[21 Oct 25 03:50 UTC] No.45652292{7}[source]▶

>>45651189 #

But you are missing the causal attention from your analysis. The output is not the only thing that is preserved, there is also the KV-cache.

At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache.

At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1.

At token 3, it can see the states from token 2 and 1 etc.

However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones.

replies(1): >>45655087 #

20. sailingparrot ◴[21 Oct 25 03:58 UTC] No.45652333{8}[source]▶

>>45651765 #

You are forgetting about attention on the kv-cache, which is the mechanism that allows LLM to not start anew everytime.

replies(1): >>45652519 #

21. nl ◴[21 Oct 25 04:38 UTC] No.45652519{9}[source]▶

>>45652333 #

I mean sure, but that is a performance optimization that doesn't really change what is going on.

It's still recalculating, just that intermediate steps are cached.

replies(1): >>45654589 #

22. sailingparrot ◴[21 Oct 25 11:31 UTC] No.45654589{10}[source]▶

>>45652519 #

Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?

But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.

A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.

If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.

23. LelouBil ◴[21 Oct 25 12:45 UTC] No.45655087{8}[source]▶

>>45652292 #

Like an other commenter said, isn't the KV cache a performance optimization to not have to redo work that was already done ? Or does it fundamentally alter the output of the LLM, and so preserves state that is not present in the output of the LLM ?

replies(1): >>45655552 #

24. sailingparrot ◴[21 Oct 25 13:29 UTC] No.45655552{9}[source]▶

>>45655087 #

Yes, it's "just" an optimization technique, in the sense that you could not have it and end up with the same result (given the same input sequence), just much slower.

Conceptually what matters is not the kv-cache but the attention. But IMHO thinking about how the model behave during inference, when outputting one token at a time and doing attention on the kv cache is much easier to grok than during training/prefilling where the kv cache is absent and everything happens in parallel (although they are mathematically equivalent).

The important part of my point, is that when the model is processing token N, it can check it's past internal state during token 1,...,N-1, and thus "see" its previous plan and reasoning, and iterate over it, rather than just repeating everything from scratch in each token's hidden state (with caveat, explained at the end).

token_1 ──▶ h₁ᴸ ────────┐

token_2 ──▶ h₂ᴸ ──attn──┼──▶ h₃ᴸ (refines reasoning)

token_3 ──▶ h₃ᴸ ──attn──┼──▶ h₄ᴸ (refines further)

And the kv-cache makes this persistent across time, so the entire system (LLM+cache) becomes effectively able to save its state, and iterate upon it at each token, and not have to start from scratch every time.

But ultimately its a markov-chain, so again mathematically, yes, you could just re-do the full computation all the time, and end up in the same place.

Caveat: Because token N at layer L can attend to all other tokens <N but only at layer L, it only allows it to see the how the reasoning was at that depth, not how it was after a full pass, so it's not a perfect information passing mechanism, and is more pyramidal than straight line. Hence why i referenced feedback transformers in another message. But the principle still applies that information is passing through time steps.

25. jama211 ◴[21 Oct 25 17:10 UTC] No.45658371{3}[source]▶

>>45647093 #

I suppose there’s something in what you’re saying, it’s just that’s it’s sorta vague and hard to parse for me. It also depends on the higher order problem space, for example: is it efficient if the problem is defined by “make something that can adapt to a problem space and solve it without manual engineering” rather than “make something with a long lead up time where you understand the problem space in advance and therefore have time to optimise the engine”. In the former, the neural network would indeed count as solving this efficiently, because of the given definition of the goal.

↑