BERT is just a single text diffusion step

(nathan.rs)

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

sailingparrot ◴[20 Oct 25 16:44 UTC] No.45645973[source]▶

>>45645307 #

> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

replies(5): >>45646422 #>>45650316 #>>45654585 #>>45656793 #>>45663541 #

1. bjourne ◴[20 Oct 25 22:37 UTC] No.45650316[source]▶

>>45645973 #

That is precisely what autoregressive means. Perhaps you meant to write that modern LLMs are not strictly autoregressive?

replies(1): >>45650411 #

2. janalsncm ◴[20 Oct 25 22:48 UTC] No.45650411[source]▶

>>45650316 (TP) #

I think they are distinguishing the mechanical process of generation from the way the idea exists. It’s the same as how a person can literally only speak one word at a time but the ideas might be nonlinear.

replies(2): >>45650595 #>>45659970 #

3. sailingparrot ◴[20 Oct 25 23:12 UTC] No.45650595[source]▶

>>45650411 #

Indeed what I meant. The LLM isn’t a blank slate at the beginning of each new token during autoregression as the kv cache is there.

4. bjourne ◴[21 Oct 25 18:51 UTC] No.45659970[source]▶

>>45650411 #

If so they are wrong. :) Autoregressive just means that the probability of the next token is just a function of the already seen/emitted tokens. Any "ideas that may exist" are entirely embedded in this sequence.

replies(1): >>45660565 #

5. sailingparrot ◴[21 Oct 25 19:32 UTC] No.45660565{3}[source]▶

>>45659970 #

> entirely embedded in this sequence.

Obviously wrong, as otherwise every model would predict exactly the same thing, it would not even be predicting anymore, simply decoding.

The sequence is not enough to reproduce the exact output, you also need the weights.

And the way the model work is by attending to its own internal state (weights*input) and refining it, both across the depth (layer) dimension and across the time (tokens) dimension.

The fact that you can get the model to give you the exact same output by fixing a few seeds, is only a consequence of the process being markovian, and is orthogonal to the fact that at each token position the model is “thinking” about a longer horizon than the present token and is able to reuse that representation at later time steps

replies(1): >>45662988 #

6. bjourne ◴[21 Oct 25 23:15 UTC] No.45662988{4}[source]▶

>>45660565 #

Well, that an autoregressive model has parameters does not mean it is not autoregressive. LLMs are not Markovian.

replies(1): >>45664399 #

7. sailingparrot ◴[22 Oct 25 02:49 UTC] No.45664399{5}[source]▶

>>45662988 #

At no point have I argued that LLMs aren’t autoregressive, I am merely talking about LLMs ability to reason across time steps, so it seems we are talking past each other which won’t lead anywhere.

And yes, LLM can be studied under the lens of Markov processes: https://arxiv.org/pdf/2410.02724

Have a good day

↑