Markov chains learn a fixed distribution, but transformers learn the distribution of distributions and latch onto what the current distribution based on evidence seen so far. So that's where the single shot learning comes from in transformer. Markov chains can't do that, they will not change the underlying distribution as they read.
replies(1):