Most active commenters

crubier(3)
froobius(3)
refulgentis(3)

Popular/hot comments

>>45645546 #

←back to thread

BERT is just a single text diffusion step

(nathan.rs)

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

>>45644328 (OP) #

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

1. crubier ◴[20 Oct 25 15:58 UTC] No.45645401[source]▶

>>45645307 #

You 100% do pronounce or write words one at a time sequentially.

But before starting your sentence, you internally formulate the gist of the sentence you're going to say.

Which is exactly what happens in LLMs latent space too before they start outputting the first token.

replies(5): >>45645466 #>>45645546 #>>45645695 #>>45645968 #>>45646205 #

2. froobius ◴[20 Oct 25 16:04 UTC] No.45645466[source]▶

>>45645401 (TP) #

(Just to expand on that, it's true not just the for the first token. There's a lot of computation, including potentially planning ahead, before each token outputted.)

That's why saying "it's just predicting the next word", is a misguided take.

3. taeric ◴[20 Oct 25 16:10 UTC] No.45645546[source]▶

>>45645401 (TP) #

I'm curious what makes you so confident on this? I confess I expect that people are often far more cognizant of the last thing that the they want to say when they start?

I don't think you do a random walk through the words of a sentence as you conceive it. But it is hard not to think people don't center themes and moods in their mind as they compose their thoughts into sentences.

Similarly, have you ever looked into how actors learn their lines? It is often in a way that is a lot closer to a diffusion than token at a time.

replies(7): >>45645580 #>>45645621 #>>45646119 #>>45646153 #>>45646165 #>>45647044 #>>45647828 #

4. CaptainOfCoit ◴[20 Oct 25 16:13 UTC] No.45645580[source]▶

>>45645546 #

I think there is a wide range of ways to "turn something in the head into words", and sometimes you use the "this is the final point, work towards it" approach and sometimes you use the "not sure what will happen, lets just start talking and go wherever". Different approaches have different tradeoffs, and of course different people have different defaults.

I can confess to not always knowing where I'll end up when I start talking. Similarly, not every time I open my mouth it's just to start but sometimes I do have a goal and conclusion.

5. refulgentis ◴[20 Oct 25 16:16 UTC] No.45645621[source]▶

>>45645546 #

It's just too far of an analogy, it starts in the familiar SWE tarpit of human brain = lim(n matmuls) as n => infinity.

Then, glorifies wrestling in said tarpit: how do people actually compose sentences? Is an LLM thinking or writing? Can you look into how actors memorize lines before responding?

Error beyond the tarpit is, these are all ineffable questions that assume a singular answer to an underspecified question across many bags of sentient meat.

Taking a step back to the start, we're wondering:

Do LLMs plan for token N + X, while purely working to output token N?

TL;DR: yes.

via https://www.anthropic.com/research/tracing-thoughts-language....

Clear quick example they have is, ask it to write a poem, get state at end of line 1, scramble the feature that looks ahead to end of line 2's rhyme.

replies(1): >>45646235 #

6. smokel ◴[20 Oct 25 16:22 UTC] No.45645695[source]▶

>>45645401 (TP) #

For most serious texts I start with a tree outline, before I engage my literary skills.

7. pessimizer ◴[20 Oct 25 16:44 UTC] No.45645968[source]▶

>>45645401 (TP) #

Like most people I jump back and forth when I speak, disclaiming, correcting, and appending to previous utterances. I do this even more when I write, eradicating entire sentences and even the ideas they contain, within paragraphs that which by the time they were finished the sentence seemed unnecessary or inconsistent.

I did it multiple times while writing this comment, and it is only four sentences. The previous sentence once said "two sentences," and after I added this statement it was changed to "four sentences."

8. btown ◴[20 Oct 25 16:52 UTC] No.45646119[source]▶

>>45645546 #

> far more cognizant of the last thing that the they want to say when they start

This can be captured by generating reasoning tokens (outputting some representation the desired conclusion in token form, then using it as context for the actual tokens), or even by an intermediate layer of a model not using reasoning.

If a certain set of nodes are strong contributors to generate the concluding sentence, and they remain strong throughout all generated tokens, who's to say if those nodes weren't capturing a latent representation of the "crux" of the answer before any tokens were generated?

(This is also in the context of the LLM being able to use long-range attention to not need to encode in full detail what it "wants to say" - just the parts of the original input text that it is focusing on over time.)

Of course, this doesn't mean that this is the optimal way to build coherent and well-reasoned answers, nor have we found an architecture that allows us to reliably understand what is going on! But the mechanics for what you describe certainly can arise in non-diffusion LLM architectures.

9. jrowen ◴[20 Oct 25 16:54 UTC] No.45646153[source]▶

>>45645546 #

They're speaking literally. When talking to someone (or writing), you ultimately say the words in order (edits or corrections notwithstanding). If you look at the gifs of how the text is generated - I don't know of anyone that has ever written like that. Literally writing disconnected individual words of the actual draft ("during," "and," "the") in the middle of a sentence and then coming back and filling in the rest. Even speaking like that would be incredibly difficult.

Which is not to say that it's wrong or a bad approach. And I get why people are feeling a connection to the "diffusive" style. But, at the end of the day, all of these methods do build as their ultimate goal a coherent sequence of words that follow one after the other. It's just a difference of how much insight you have into the process.

replies(1): >>45647690 #

10. Workaccount2 ◴[20 Oct 25 16:55 UTC] No.45646165[source]▶

>>45645546 #

People don't come up with things their brain does.

Words rise from an abyss and are served to you, you have zero insight into their formation. If I tell you to think of an animal, one just appears in your "context", how it got there is unknown.

So really there is no argument to be made, because we still don't mechanistically understand how the brain works.

replies(1): >>45646871 #

11. NoMoreNicksLeft ◴[20 Oct 25 16:58 UTC] No.45646205[source]▶

>>45645401 (TP) #

>You 100% do pronounce or write words one at a time sequentially.

It's statements like these that make me wonder if I am the same species as everyone else. Quite often, I've picked adjectives and idioms first, and then fill in around them to form sentences. Often because there is some pun or wordplay, or just something that has a nice ring to it, and I want to lead my words in that direction. If you're only choosing them one at a time and sequentially, have you ever considered that you might just be a dimwit?

It's not like you don't see this happening all around you in others. Sure you can't read minds, but have you never once watched someone copyedit something they've written, where they move phrases and sentences around, where they switch out words for synonyms, and so on? There are at least dozens of fictional scenes in popular media, you must have seen one. You have to have noticed hints at some point in your life that this occurs. Please. Just tell me that you spoke hastily to score internet argument points, and that you don't believe this thing you've said.

replies(2): >>45647651 #>>45652847 #

12. jsrozner ◴[20 Oct 25 17:00 UTC] No.45646235{3}[source]▶

>>45645621 #

Let's just not call it planning.

In order to model poetry autoregressively, you're going to need a variable that captures rhyme scheme. At the point where you've ended the first line, the model needs to keep track of the rhyme that was used, just like it does for something like coreference resolution.

I don't think that the mentioned paper shows that the model engages in a preplanning phase in which it plans the rhyme that will come. In fact such would be impossible. Model state is present only in so-far-generated text. It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable. (Now yes, as you increase the posterior probability of 'being in a poem' given context so far, you would expect that you also increase the probability of the rhyme-scheme variable's existing.)

replies(2): >>45646312 #>>45648785 #

13. refulgentis ◴[20 Oct 25 17:06 UTC] No.45646312{4}[source]▶

>>45646235 #

I’m confused: the blog shows they A) predict the end of line 2 using the state at the end of line 1 and B) can choose the end of line 2 by altering state at end of line 1.

Might I trouble you for help getting from there to “such would be impossible”, where such is “the model…plans the rhyme to come”

Edit: I’m surprised to be at -2 for this. I am representing the contents of the post accurately. Its unintuitive for sure, but, it’s the case.

replies(1): >>45655442 #

14. aeonik ◴[20 Oct 25 17:48 UTC] No.45646871{3}[source]▶

>>45646165 #

We don't know exactly how consciousness works in the human brain, but we know way more than "comes from the abyss".

When I read that text, something like this happens:

Visual perception of text (V1, VWFA) → Linguistic comprehension (Angular & Temporal Language Areas) → Semantic activation (Temporal + Hippocampal Network) → Competitive attractor stabilization (Prefrontal & Cingulate) → Top-down visual reactivation (Occipital & Fusiform) → Conscious imagery (Prefrontal–Parietal–Thalamic Loop).

and you can find experts in each of those areas who understand the specifics a lot more.

replies(1): >>45647195 #

15. bee_rider ◴[20 Oct 25 18:01 UTC] No.45647044[source]▶

>>45645546 #

It must be the case that some smart people have studied how we think, right?

The first person experience of having a thought, to me, feels like I have the whole thought in my head, and then I imagine expressing it to somebody one word at a time. But it really feels like I’m reading out the existing thought.

Then, if I’m thinking hard, I go around a bit and argue against the thought that was expressed in my head (either because it is not a perfect representation of the actual underlying thought, or maybe because it turns out that thought was incorrect once I expressed it sequentially).

At least that’s what I think thinking feels like. But, I am just a guy thinking about my brain. Surely philosophers of the mind or something have queried this stuff with more rigor.

16. giardini ◴[20 Oct 25 18:14 UTC] No.45647195{4}[source]▶

>>45646871 #

aeonik says >"We don't know exactly how consciousness works in the human brain, but we know way more than "comes from the abyss"."<

You are undoubtedly technically correct, but I prefer the simplicity, purity and ease-of-use of the abysmal model, especially in comparison with other similar competing models, such as the below-discussed "tarpit" model.

17. crubier ◴[20 Oct 25 18:50 UTC] No.45647651[source]▶

>>45646205 #

Are you able to pronounce multiple words in superposition at the same time? Are you able to write multiple words in superposition? Can you read the following sentence: "HWeolrllod!"

Clearly communication is sequential.

LLMs are not more sequential than your vocal chords or your hand writing. They also plan ahead before writing.

18. tekne ◴[20 Oct 25 18:52 UTC] No.45647690{3}[source]▶

>>45646153 #

Weird anecdote, but one of the reasons I have always struggled with writing is precisely that my process seems highly nonlinear. I start with a disjoint mind map of ideas I want to get out, often just single words, and need to somehow cohere that into text, which often happens out-of-order. The original notes are often completely unordered diffusion-like scrawling, the difference being I have less idea what final the positions of the words were going to be when I wrote them.

replies(1): >>45648043 #

19. ◴[20 Oct 25 19:03 UTC] No.45647828[source]▶

>>45645546 #

20. crubier ◴[20 Oct 25 19:22 UTC] No.45648043{4}[source]▶

>>45647690 #

I can believe that your abstract thoughts in latent space are diffusing/forming progressively when you are thinking.

But I can't believe the actual literal words are diffusing when you're thinking.

When being asked: "How are you today", there is no way that your thoughts are literally like "Alpha zulu banana" => "I banana coco" => "I banana good" => "I am good". The diffusion does not happen at the output token layer, it happens much earlier at a higher level of abstraction.

replies(1): >>45648242 #

21. jrowen ◴[20 Oct 25 19:37 UTC] No.45648242{5}[source]▶

>>45648043 #

Or like this:

"I ____ ______ ______ ______ and _____ _____ ______ ____ the ____ _____ _____ _____."

If the images in the article are to be considered an accurate representation, the model is putting meaningless bits of connective tissue way before the actual ideas. Maybe it's not working like that. But the "token-at-a-time" model is also obviously not literally looking at only one word at a time either.

22. froobius ◴[20 Oct 25 20:19 UTC] No.45648785{4}[source]▶

>>45646235 #

> Model state is present only in so-far-generated text

Wrong. There's "model state", (I assume you mean hidden layers), not just in the generated text, but also in the initial prompt given to the model. I.e. the model can start its planning from the moment it's given the instruction, without even having predicted a token yet. That's actually what they show in the paper above...

> It is only after the model has found itself in a poetry generating context and has also selected the first line-ending word, that a rhyme scheme "emerges" as a variable

This is an assertion based on flawed reasoning.

(Also, these ideas should really be backed up by evidence and experimentation before asserting them so definitively.)

23. stevenhuang ◴[21 Oct 25 05:46 UTC] No.45652847[source]▶

>>45646205 #

All of that can can still be seen as a linear sequence of actions from the perspective of human I/O with the environment.

What happens in the black box of the human mind to determine the next word to write/say is exactly made irrelevant in this level of abstraction, as regardless how, it would still result in a linear sequence of actions as observed by the environment.

24. froobius ◴[21 Oct 25 13:21 UTC] No.45655442{5}[source]▶

>>45646312 #

I agree, the post above you is patently wrong / hasn't read the paper they are dismissing. I also got multiple downvotes for disagreeing, with no actual rebuttal.

replies(1): >>45655646 #

25. refulgentis ◴[21 Oct 25 13:36 UTC] No.45655646{6}[source]▶

>>45655442 #

You're my fav new-ish account, spent about 5 minutes Googling froobius yesterday tryna find more content. :) Concise, clear, no BS takes for high-minded nonsense that sounds technical. HNs such a hellhole for LLM stuff, the people who are hacking ain't here, and the people who are, well, they mostly like yapping about how it connects to some unrelated grand idea they misremember from undergrad. Cheers.

(n.b. been here 16 years and this is such a classic downvote scenario the past two years. people overindexing on big words that are familiar to them, and on any sort of challenging tone. That's almost certainly why I got mine, I was the dummy who read the article and couldn't grasp the stats nonsense, and "could I bother you to help" or w/e BS I said, well, was BS)

↑