Autoregressive LLMs don't do that either actually. Sure with one forward pass you only get one token at a time, but looking at what is happening in the latent space there are clear signs of long term planning and reasoning that go beyond just the next token.
So I don't think it's necessarily more or less similar to us than diffusion, we do say one word at a time sequentially, even if we have the bigger picture in mind.
Obviously wrong, as otherwise every model would predict exactly the same thing, it would not even be predicting anymore, simply decoding.
The sequence is not enough to reproduce the exact output, you also need the weights.
And the way the model work is by attending to its own internal state (weights*input) and refining it, both across the depth (layer) dimension and across the time (tokens) dimension.
The fact that you can get the model to give you the exact same output by fixing a few seeds, is only a consequence of the process being markovian, and is orthogonal to the fact that at each token position the model is “thinking” about a longer horizon than the present token and is able to reuse that representation at later time steps
And yes, LLM can be studied under the lens of Markov processes: https://arxiv.org/pdf/2410.02724
Have a good day