Mercury: Ultra-fast language models based on diffusion

In their tech report, they say this is based on:

> "Our methods extend [28] through careful modifications to the data and computation to scale up learning."

[28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion" (SEDD) model (https://arxiv.org/abs/2310.16834).

I wrote the first (as far as I can tell) independent from-scratch reimplementation of SEDD:

My goal was making it as clean and readable as possible. I also implemented the more complex denoising strategy they described (but didn't implement).

It runs on a single GPU in a few hours on a toy dataset.