Kimi Linear: An Expressive, Efficient Attention Architecture

(github.com)

210 points blackcat201 | 1 comments | 31 Oct 25 00:07 UTC | HN request time: 0s | source

Show context

eXpl0it3r ◴[31 Oct 25 08:47 UTC] No.45769733[source]▶

>>45766937 (OP) #

For the uninitiated, what's a "hybrid linear attention architecture"?

replies(2): >>45769822 #>>45772001 #

quotemstr ◴[31 Oct 25 08:59 UTC] No.45769822[source]▶

>>45769733 #

1/4 of their layers are conventional quadratic attention

replies(1): >>45770016 #

meowface ◴[31 Oct 25 09:32 UTC] No.45770016[source]▶

>>45769822 #

Could someone explain every term in this subthread in a very simple way to someone who basically only knows "transformers are a neural network architecture that use something called 'attention' to consider the entire input the whole time or something like that", and who does not understand what "quadratic" even means in a time complexity or mathematical sense beyond that "quad" has something to do with the number four.

I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.

replies(3): >>45770266 #>>45771250 #>>45780732 #

1. hexaga ◴[01 Nov 25 10:59 UTC] No.45780732[source]▶

>>45770016 #

There are different varieties of attention, which just amounts to some kind of learned mixing function between tokens in a sequence.

For an input of length N (tokens), the standard kind of attention requires N squared operations (hence, quadratic - it scales with the square of input length). You have to check how every token attends to every other token.

There are a bunch of alternative mixing functions which are instead linear with respect to N. Every additional token costs the same amount of work. The typical method is to have a constant size state manipulated recurrently, which necessarily implies some level of lossy compression in the state (quadratic attention doesn't really have state in this sense - it computes and checks every possible relation always).

Linear attentions kind of suck in comparison to quadratic attention but the efficiency is very attractive, especially at inference time where you don't need more VRAM to store more context.

TLDR; conventional attentions scale N^2 time, N space (kv cache), and are exact. linear attentions scale N time, constant space (recurrent state), and are lossy.

↑