(nathan.rs)

454 points nathan-barry | 1 comments | 20 Oct 25 14:31 UTC | HN request time: 0.199s | source

Show context

jaaustin ◴[20 Oct 25 15:19 UTC] No.45644944[source]▶

To my knowledge this connection was first noted in 2021 in https://arxiv.org/abs/2107.03006 (page 5). We wanted to do text diffusion where you’d corrupt words to semantically similar words (like “quick brown fox” -> “speedy black dog”) but kept finding that masking was easier for the model to uncover. Historically this goes back even further to https://arxiv.org/abs/1904.09324, which made a generative MLM without framing it in diffusion math.

replies(3): >>45645010 #>>45647380 #>>45648253 #

1. axiom92 ◴[20 Oct 25 18:29 UTC] No.45647380[source]▶

>>45644944 #

Yeah, that's the first formal reference I remember as well (although, BERT is probably the first thing NLP folks will think of after reading about diffusion).

I collected a few other text-diffusion early references here about 3 years ago: https://github.com/madaan/minimal-text-diffusion?tab=readme-....

↑

BERT is just a single text diffusion step