Thank you for this informative and thoughtful post. An interesting twist to the increasing error accumulation as autoregressive models generate more output, is the recent success of language diffusion models for predicting multiple tokens simultaneously. They have a remasking strategy at every step of the reviser process, that masks low confidence tokens. Regardless your observations perhaps still apply. https://arxiv.org/pdf/2502.09992
replies(1):