Kimi Linear: An Expressive, Efficient Attention Architecture

Afaik there are two types of attention, cross and self attention. It's quadratic becase you have to process one set of tokens with another, like calculating a matrix product. Originally designed for translation, you'd take tokens in one language on one side and the other language on the other, then compute the relevance of each word with each other which the model then uses further to more accurately generate the translation.

With self attention you compute every token in a sequence with every other token in that same sequence, figuring out which word references which other word (e.g. "George is sitting in the park. He's reading a book.", "He" would correlate with "George", letting the model know what it refers to). Of course these are also trained layers so what the model thinks correlates with what and how that info is used in the DNN perceptron part is depends wholly on the training process.

There is no free lunch with this and with only 1/4 of layers having it, the model will perform significantly worse at identifying relevant info and likely decohere a lot compared to having it on every layer. But since you get rid of the quadratic complicity, it'll be much faster. Think "I'm doing 1000 calculations per second and they're all wrong" meme. So far there have been lots of attempts at doing linear-ish attention (e.g. Google doing the sliding window hackery that only computes a part of the vectors and hopes for good locality, mamba combinations with RNNs, Meta removing positional encodings in attention in the trainwreck that was LLama4, etc.) and they've mostly failed, so the holy grail is finding a way to make it work since you get the best of both worlds. The top performing models today all use fully quadratic attention or combine it with sliding windows in some layers to claw back some speed in long context scenarios at the cost of some accuracy.