BERT is just a single text diffusion step

(nathan.rs)

454 points nathan-barry | 2 comments | 20 Oct 25 14:31 UTC | HN request time: 0s | source

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

crubier ◴[20 Oct 25 15:58 UTC] No.45645401[source]▶

>>45645307 #

You 100% do pronounce or write words one at a time sequentially.

But before starting your sentence, you internally formulate the gist of the sentence you're going to say.

Which is exactly what happens in LLMs latent space too before they start outputting the first token.

replies(5): >>45645466 #>>45645546 #>>45645695 #>>45645968 #>>45646205 #

taeric ◴[20 Oct 25 16:10 UTC] No.45645546[source]▶

>>45645401 #

I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?

I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.

Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.

replies(7): >>45645580 #>>45645621 #>>45646119 #>>45646153 #>>45646165 #>>45647044 #>>45647828 #

refulgentis ◴[20 Oct 25 16:16 UTC] No.45645621{3}[source]▶

>>45645546 #

It's just too far of an analogy, it starts in the familiar SWE tarpit of human brain = lim(n matmuls) as n => infinity.

Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?

Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.

Taking a step back to the start, we're wondering:

Do LLMs plan for token N + X, while purely working to output token N?

TL;DR: yes.

via https://www.anthropic.com/research/tracing-thoughts-language....

Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.

replies(1): >>45646235 #

jsrozner ◴[20 Oct 25 17:00 UTC] No.45646235{4}[source]▶

>>45645621 #

Let's just not call it planning.

In order to model poetry autoregressively, you're going to need a variable that captures rhyme scheme. At the point where you've ended the first line, the model needs to keep track of the rhyme that was used, just like it does for something like coreference resolution.

I don't think that the mentioned paper shows that the model engages in a preplanning phase in which it plans the rhyme that will come. In fact such would be impossible. Model state is present only in so-far-generated text. It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable. (Now yes, as you increase the posterior probability of 'being in a poem' given context so far, you would expect that you also increase the probability of the rhyme-scheme variable's existing.)

replies(2): >>45646312 #>>45648785 #

refulgentis ◴[20 Oct 25 17:06 UTC] No.45646312{5}[source]▶

>>45646235 #

I’m confused: the blog shows they A) predict the end of line 2 using the state at the end of line 1 and B) can choose the end of line 2 by altering state at end of line 1.

Might I trouble you for help getting from there to “such would be impossible”, where such is “the model…plans the rhyme to come”

Edit: I’m surprised to be at -2 for this. I am representing the contents of the post accurately. Its unintuitive for sure, but, it’s the case.

replies(1): >>45655442 #

1. froobius ◴[21 Oct 25 13:21 UTC] No.45655442{4}[source]▶

>>45646312 #

I agree, the post above you is patently wrong / hasn't read the paper they are dismissing. I also got multiple downvotes for disagreeing, with no actual rebuttal.

replies(1): >>45655646 #

2. refulgentis ◴[21 Oct 25 13:36 UTC] No.45655646[source]▶

>>45655442 (TP) #

You're my fav new-ish account, spent about 5 minutes Googling froobius yesterday tryna find more content. :) Concise, clear, no BS takes for high-minded nonsense that sounds technical. HNs such a hellhole for LLM stuff, the people who are hacking ain't here, and the people who are, well, they mostly like yapping about how it connects to some unrelated grand idea they misremember from undergrad. Cheers.

(n.b. been here 16 years and this is such a classic downvote scenario the past two years. people overindexing on big words that are familiar to them, and on any sort of challenging tone. That's almost certainly why I got mine, I was the dummy who read the article and couldn't grasp the stats nonsense, and "could I bother you to help" or w/e BS I said, well, was BS)

↑