"Here's the key insight: each forward pass processes ALL tokens in ALL sequences simultaneously."
This sounds incorrect, you only process all tokens once, and later incrementally. It's an auto-regressive model after all.
replies(1):
From that point on every subsequent tokens is processed sequentially in autoregressive way, but because we have the KV cache, this becomes O(N) (1 token query to all tokens) and not O(N^2)