It feels like it would make more sense to allow the model to do Levenshtein-like edits instead of just masking and filling in the masked tokens. It seems that intuitively it's really hard in this diffusion setup to just swap one word with a longer but better synonym towards the end, because there's no way to shift everything to the right afterwards.
replies(1):