←back to thread

454 points nathan-barry | 1 comments | | HN request time: 0.198s | source
Show context
jaaustin ◴[] No.45644944[source]
To my knowledge this connection was first noted in 2021 in https://arxiv.org/abs/2107.03006 (page 5). We wanted to do text diffusion where you’d corrupt words to semantically similar words (like “quick brown fox” -> “speedy black dog”) but kept finding that masking was easier for the model to uncover. Historically this goes back even further to https://arxiv.org/abs/1904.09324, which made a generative MLM without framing it in diffusion math.
replies(3): >>45645010 #>>45647380 #>>45648253 #
1. koningrobot ◴[] No.45648253[source]
It goes further back than that. In 2014, Li Yao et al (https://arxiv.org/abs/1409.0585) drew an equivalence between autoregressive (next token prediction, roughly) generative models and generative stochastic networks (denoising autoencoders, the predecessor to difussion models). They argued that the parallel sampling style correctly approximates sequential sampling.

In my own work circa 2016 I used this approach in Counterpoint by Convolution (https://arxiv.org/abs/1903.07227), where we in turn argued that despite being an approximation, it leads to better results. Sadly being dressed up as an application paper, we weren't able to draw enough attention to get those sweet diffusion citations.

Pretty sure it goes further back than that still.