I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound.
But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).
replies(2):