BERT is just a single text diffusion step

1. jaaustin ◴[20 Oct 25 15:19 UTC] No.45644944[source]▶

To my knowledge this connection was first noted in 2021 in https://arxiv.org/abs/2107.03006 (page 5). We wanted to do text diffusion where you’d corrupt words to semantically similar words (like “quick brown fox” -> “speedy black dog”) but kept finding that masking was easier for the model to uncover. Historically this goes back even further to https://arxiv.org/abs/1904.09324, which made a generative MLM without framing it in diffusion math.

replies(3): >>45645010 #>>45647380 #>>45648253 #

2. loubbrad ◴[20 Oct 25 15:25 UTC] No.45645010[source]▶

>>45644944 (TP) #

Also relevent - https://arxiv.org/pdf/1902.04094

3. axiom92 ◴[20 Oct 25 18:29 UTC] No.45647380[source]▶

>>45644944 (TP) #

Yeah, that's the first formal reference I remember as well (although, BERT is probably the first thing NLP folks will think of after reading about diffusion).

I collected a few other text-diffusion early references here about 3 years ago: https://github.com/madaan/minimal-text-diffusion?tab=readme-....

4. koningrobot ◴[20 Oct 25 19:38 UTC] No.45648253[source]▶

>>45644944 (TP) #

It goes further back than that. In 2014, Li Yao et al (https://arxiv.org/abs/1409.0585) drew an equivalence between autoregressive (next token prediction, roughly) generative models and generative stochastic networks (denoising autoencoders, the predecessor to difussion models). They argued that the parallel sampling style correctly approximates sequential sampling.

In my own work circa 2016 I used this approach in Counterpoint by Convolution (https://arxiv.org/abs/1903.07227), where we in turn argued that despite being an approximation, it leads to better results. Sadly being dressed up as an application paper, we weren't able to draw enough attention to get those sweet diffusion citations.

Pretty sure it goes further back than that still.