BERT is just a single text diffusion step

(nathan.rs)

454 points nathan-barry | 1 comments | 20 Oct 25 14:31 UTC | HN request time: 0.209s | source

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

crubier ◴[20 Oct 25 15:58 UTC] No.45645401[source]▶

>>45645307 #

You 100% do pronounce or write words one at a time sequentially.

But before starting your sentence, you internally formulate the gist of the sentence you're going to say.

Which is exactly what happens in LLMs latent space too before they start outputting the first token.

replies(5): >>45645466 #>>45645546 #>>45645695 #>>45645968 #>>45646205 #

taeric ◴[20 Oct 25 16:10 UTC] No.45645546[source]▶

>>45645401 #

I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?

I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.

Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.

replies(7): >>45645580 #>>45645621 #>>45646119 #>>45646153 #>>45646165 #>>45647044 #>>45647828 #

1. btown ◴[20 Oct 25 16:52 UTC] No.45646119[source]▶

>>45645546 #

> far more cognizant of the last thing that the they want to say when they start

This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.

If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?

(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)

Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.

↑