BERT is just a single text diffusion step

(nathan.rs)

454 points nathan-barry | 1 comments | 20 Oct 25 14:31 UTC | HN request time: 0.387s | source

Show context

kibwen ◴[20 Oct 25 15:52 UTC] No.45645307[source]▶

To me, the diffusion-based approach "feels" more akin to whats going on in an animal brain than the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words; I start by having some fuzzy idea in my head and the challenge is in serializing it into language coherently.

replies(14): >>45645350 #>>45645383 #>>45645401 #>>45645402 #>>45645509 #>>45645523 #>>45645607 #>>45645665 #>>45645670 #>>45645891 #>>45645973 #>>45647491 #>>45648578 #>>45652892 #

sailingparrot ◴[20 Oct 25 16:44 UTC] No.45645973[source]▶

>>45645307 #

> the token-at-a-time approach of the in-vogue LLMs. Speaking for myself, I don't generate words one a time based on previously spoken words

Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.

So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.

replies(5): >>45646422 #>>45650316 #>>45654585 #>>45656793 #>>45663541 #

wizzwizz4 ◴[20 Oct 25 17:14 UTC] No.45646422[source]▶

>>45645973 #

If a process is necessary for performing a task, (sufficiently-large) neural networks trained on that task will approximate that process. That doesn't mean they're doing it anything resembling efficiently, or that a different architecture / algorithm wouldn't produce a better result.

replies(2): >>45646920 #>>45647495 #

jama211 ◴[20 Oct 25 17:52 UTC] No.45646920[source]▶

>>45646422 #

It also doesn’t mean they’re doing it inefficiently.

replies(1): >>45647093 #

pinkmuffinere ◴[20 Oct 25 18:05 UTC] No.45647093[source]▶

>>45646920 #

I read this to mean “just because the process doesn’t match the problem, that doesn’t mean it’s inefficient”. But I think it does mean that. I expect we intuitively know that data structures which match the structure of a problem are more efficient than those that don’t. I think the same thing applies here.

I realize my argument is hand wavey, i haven’t defined “efficient“ (in space? Time? Energy?), and there are other shortcomings, but I feel this is “good enough” to be convincing

replies(2): >>45647687 #>>45658371 #

1. wizzwizz4 ◴[20 Oct 25 18:52 UTC] No.45647687[source]▶

>>45647093 #

Example: a list of (key, value) pairs is a perfectly valid way to implement a map, and suffices. However, a more complicated tree structure, perhaps with hashed keys, is usually way more efficient, which is increasingly-noticeable as the number of pairs stored in the map grows large.

↑