BERT is just a single text diffusion step

(nathan.rs)

454 points nathan-barry | 4 comments | 20 Oct 25 14:31 UTC | HN request time: 0s | source

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

crubier ◴[20 Oct 25 15:58 UTC] No.45645401[source]▶

>>45645307 #

You 100% do pronounce or write words one at a time sequentially.

But before starting your sentence, you internally formulate the gist of the sentence you're going to say.

Which is exactly what happens in LLMs latent space too before they start outputting the first token.

replies(5): >>45645466 #>>45645546 #>>45645695 #>>45645968 #>>45646205 #

taeric ◴[20 Oct 25 16:10 UTC] No.45645546[source]▶

>>45645401 #

I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?

I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.

Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.

replies(7): >>45645580 #>>45645621 #>>45646119 #>>45646153 #>>45646165 #>>45647044 #>>45647828 #

1. jrowen ◴[20 Oct 25 16:54 UTC] No.45646153{3}[source]▶

>>45645546 #

They're speaking literally. When talking to someone (or writing), you ultimately say the words in order (edits or corrections notwithstanding). If you look at the gifs of how the text is generated - I don't know of anyone that has ever written like that. Literally writing disconnected individual words of the actual draft ("during," "and," "the") in the middle of a sentence and then coming back and filling in the rest. Even speaking like that would be incredibly difficult.

Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.

replies(1): >>45647690 #

2. tekne ◴[20 Oct 25 18:52 UTC] No.45647690[source]▶

>>45646153 (TP) #

Weird anecdote, but one of the reasons I have always struggled with writing is precisely that my process seems highly nonlinear. I start with a disjoint mind map of ideas I want to get out, often just single words, and need to somehow cohere that into text, which often happens out-of-order. The original notes are often completely unordered diffusion-like scrawling, the difference being I have less idea what final the positions of the words were going to be when I wrote them.

replies(1): >>45648043 #

3. crubier ◴[20 Oct 25 19:22 UTC] No.45648043[source]▶

>>45647690 #

I can believe that your abstract thoughts in latent space are diffusing/forming progressively when you are thinking.

But I can't believe the actual literal words are diffusing when you're thinking.

When being asked: "How are you today", there is no way that your thoughts are literally like "Alpha zulu banana" => "I banana coco" => "I banana good" => "I am good". The diffusion does not happen at the output token layer, it happens much earlier at a higher level of abstraction.

replies(1): >>45648242 #

4. jrowen ◴[20 Oct 25 19:37 UTC] No.45648242{3}[source]▶

>>45648043 #

Or like this:

"I ____ ______ ______ ______ and _____ _____ ______ ____ the ____ _____ _____ _____."

If the images in the article are to be considered an accurate representation, the model is putting meaningless bits of connective tissue way before the actual ideas. Maybe it's not working like that. But the "token-at-a-time" model is also obviously not literally looking at only one word at a time either.

↑