I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
With self attention you compute every token in a sequence with every other token in that same sequence, figuring out which word references which other word (e.g. "George is sitting in the park. He's reading a book.", "He" would correlate with "George", letting the model know what it refers to). Of course these are also trained layers so what the model thinks correlates with what and how that info is used in the DNN perceptron part is depends wholly on the training process.
There is no free lunch with this and with only 1/4 of layers having it, the model will perform significantly worse at identifying relevant info and likely decohere a lot compared to having it on every layer. But since you get rid of the quadratic complicity, it'll be much faster. Think "I'm doing 1000 calculations per second and they're all wrong" meme. So far there have been lots of attempts at doing linear-ish attention (e.g. Google doing the sliding window hackery that only computes a part of the vectors and hopes for good locality, mamba combinations with RNNs, Meta removing positional encodings in attention in the trainwreck that was LLama4, etc.) and they've mostly failed, so the holy grail is finding a way to make it work since you get the best of both worlds. The top performing models today all use fully quadratic attention or combine it with sliding windows in some layers to claw back some speed in long context scenarios at the cost of some accuracy.
- apply learned knowledge from its parameters to every part of the input representation („tokenized“, ie, chunkified text).
- apply mixing of the input representation with other parts of itself. This is called „attention“ for historical reasons. The original attention computes mixing of (roughly) every token (say N) with every other (N). Thus we pay a compute cost relative to N squared.
The attention cost therefore grows quickly in terms of compute and memory requirements when the input / conversation becomes long (or may even contain documents).
It is a very active field of research to reduce the quadratic part to something cheaper, but so far this has been rather difficult, because as you readily see this means that you have to give up mixing every part of the input with every other.
Most of the time mixing token representations close to each other is more important than those that are far apart, but not always. That’s why there are many attempts now to do away with most of the quadratic attention layers but keeping some.
What to do during mixing when you give up all-to-all attention is the big research question because many approaches seem to behave well only under some conditions and we haven’t established something as good and versatile as all-to-all attention.
If you forgo all-to-all you also open up so many options (eg. all-to-something followed by something-to-all as a pattern, where something serves as a sort of memory or state that summarizes all inputs at once. You can imagine that summarizing all inputs well is a lossy abstraction though, etc.)
For an input of length N (tokens), the standard kind of attention requires N squared operations (hence, quadratic - it scales with the square of input length). You have to check how every token attends to every other token.
There are a bunch of alternative mixing functions which are instead linear with respect to N. Every additional token costs the same amount of work. The typical method is to have a constant size state manipulated recurrently, which necessarily implies some level of lossy compression in the state (quadratic attention doesn't really have state in this sense - it computes and checks every possible relation always).
Linear attentions kind of suck in comparison to quadratic attention but the efficiency is very attractive, especially at inference time where you don't need more VRAM to store more context.
TLDR; conventional attentions scale N^2 time, N space (kv cache), and are exact. linear attentions scale N time, constant space (recurrent state), and are lossy.