BERT is just a single text diffusion step

(nathan.rs)

454 points nathan-barry | 1 comments | 20 Oct 25 14:31 UTC | HN request time: 0s | source

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

sailingparrot ◴[20 Oct 25 16:44 UTC] No.45645973[source]▶

>>45645307 #

> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

replies(5): >>45646422 #>>45650316 #>>45654585 #>>45656793 #>>45663541 #

1. janalsncm ◴[22 Oct 25 00:23 UTC] No.45663541[source]▶

>>45645973 #

To take a simple example, let’s say we ask an autoregressive model a yes/no factual question like “is 1+1=2?”. Then, we force the LLM to start with the wrong answer “No, “ and continue decoding.

An autoregressive model can’t edit the past. If it happens to sample the wrong first token (or we force it to in this case), there’s no going back. Of course there can be many more complicated lines of thinking as well where backtracking would be nice.

“Reasoning” LLMs tack this on with reasoning tokens. But the issue with this is that the LLM has to attend to every incorrect, irrelevant line of thinking which is at a minimum a waste and likely confusing.

As an analogy, in HN I don’t need to attend to every comment under a post in order to generate my next word. I probably just care about the current thread from my comment up to the OP. Of course a model could learn that relationship but that’s a huge waste of compute.

Text diffusion solves the whole problem entirely by allowing the model to simply revise the “no” to a “yes”. Very simple.

↑