BERT is just a single text diffusion step

(nathan.rs)

454 points nathan-barry | 1 comments | 20 Oct 25 14:31 UTC | HN request time: 0s | source

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

sailingparrot ◴[20 Oct 25 16:44 UTC] No.45645973[source]▶

>>45645307 #

> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

replies(5): >>45646422 #>>45650316 #>>45654585 #>>45656793 #>>45663541 #

wizzwizz4 ◴[20 Oct 25 17:14 UTC] No.45646422[source]▶

>>45645973 #

If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.

replies(2): >>45646920 #>>45647495 #

sailingparrot ◴[20 Oct 25 18:38 UTC] No.45647495[source]▶

>>45646422 #

I’m not arguing about efficiency though ? Simply saying next token predictors cannot be thought of as actually just thinking about the next token with no long term plan.

replies(1): >>45648362 #

wizzwizz4 ◴[20 Oct 25 19:47 UTC] No.45648362[source]▶

>>45647495 #

They rebuild the "long term plan" anew for every token: there's no guarantee that the reconstructed plan will remain similar between tokens. That's not how planning normally works. (You can find something like this every time there's this kind of gross inefficiency, which is why I gave the general principle.)

replies(3): >>45648898 #>>45648950 #>>45651754 #

HarHarVeryFunny ◴[20 Oct 25 20:34 UTC] No.45648950[source]▶

>>45648362 #

Actually, due to using causal (masked) attention, new tokens appended to the input don't have any effect on what's calculated internally (the "plan") at earlier positions in the input, and a modern LLM therefore uses a KV cache rather than recalculating at those earlier positions.

In other words, the "recalculated" plan will be exactly the same as before, just extended with new planning at the position of each newly appended token.

replies(1): >>45649010 #

astrange ◴[20 Oct 25 20:38 UTC] No.45649010[source]▶

>>45648950 #

You can violate the plan in the sampler by making an "unreasonable" choice of next token to sample (eg by raising the temperature.) So if it does stick to the same plan, it's not going to be a very good one.

replies(1): >>45649359 #

HarHarVeryFunny ◴[20 Oct 25 21:03 UTC] No.45649359[source]▶

>>45649010 #

Yeah.

Karpathy recently referred to LLMs having more "working memory" than a human, apparently referring to these unchanging internal activations as "memory", but it's an odd sort of "working memory" if you can't actually update it to reflect progress on what you are working on, or update per new information (new unexpected token having been sampled).

replies(1): >>45649723 #

sailingparrot ◴[20 Oct 25 21:37 UTC] No.45649723{3}[source]▶

>>45649359 #

I think a better mental framework of how those model work is that they keep an history of the state of their "memory" across time.

Where humans have a single evolving state of our memory LLMs have access to all the states of their "memories" across time, and while past state can't be changed, the new state can: This is the current token's hidden state, and to form this new state they look both at the history of previous states as well as the new information (last token having been sample, or external token from RAG or whatnot appended to the context).

This is how progress is stored.

replies(1): >>45650077 #

HarHarVeryFunny ◴[20 Oct 25 22:11 UTC] No.45650077{4}[source]▶

>>45649723 #

Thanks, that's a useful way to think about it.

Presumably the internal state at any given token position must also be encoding information specific to that position, as well as this evolving/current memory... So, can this be seen in the internal embeddings - are they composed of a position-dependent part that changes a lot between positions, and an evolving memory part that is largely similar between positions only changing slowly?

Are there any papers or talks discussing this ?

replies(1): >>45650379 #

1. sailingparrot ◴[20 Oct 25 22:44 UTC] No.45650379{5}[source]▶

>>45650077 #

I don't remember any paper looking at this specific question (thought it might be out there), but in general Anthropic's circuit threads series of article is very good on the broader subject: https://transformer-circuits.pub

↑