In their tech report, they say this is based on:
> "Our methods extend [28] through careful modifications to the data and computation to scale up learning."
[28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion" (SEDD) model (https://arxiv.org/abs/2310.16834).
I wrote the first (as far as I can tell) independent from-scratch reimplementation of SEDD:
https://github.com/mstarodub/dllm
My goal was making it as clean and readable as possible. I also implemented the more complex denoising strategy they described (but didn't implement).
It runs on a single GPU in a few hours on a toy dataset.