The problem with this approach to text generation is that it's still not flexible enough. If during inference the model changes its mind and wants to output something considerably different it can't because there are too many tokens already in place.
That's not true, you could just have looked at the first gif animation in the OP and seen that tokens disappear, the only part that stays untouched is the prompt, adding noise is part of the diffusion process and the code that does it is even posted in the article (ctrl+f "def diffusion_collator").