I'm more excited about approaches like this one:
https://openreview.net/forum?id=c05qIG1Z2B
They're doing continuous latent diffusion combined with autoregressive transformer-based text generation. The autoencoder and transformer are (or can be) trained in tandem.