I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
- apply learned knowledge from its parameters to every part of the input representation („tokenized“, ie, chunkified text).
- apply mixing of the input representation with other parts of itself. This is called „attention“ for historical reasons. The original attention computes mixing of (roughly) every token (say N) with every other (N). Thus we pay a compute cost relative to N squared.
The attention cost therefore grows quickly in terms of compute and memory requirements when the input / conversation becomes long (or may even contain documents).
It is a very active field of research to reduce the quadratic part to something cheaper, but so far this has been rather difficult, because as you readily see this means that you have to give up mixing every part of the input with every other.
Most of the time mixing token representations close to each other is more important than those that are far apart, but not always. That’s why there are many attempts now to do away with most of the quadratic attention layers but keeping some.
What to do during mixing when you give up all-to-all attention is the big research question because many approaches seem to behave well only under some conditions and we haven’t established something as good and versatile as all-to-all attention.
If you forgo all-to-all you also open up so many options (eg. all-to-something followed by something-to-all as a pattern, where something serves as a sort of memory or state that summarizes all inputs at once. You can imagine that summarizing all inputs well is a lossy abstraction though, etc.)