I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
For an input of length N (tokens), the standard kind of attention requires N squared operations (hence, quadratic - it scales with the square of input length). You have to check how every token attends to every other token.
There are a bunch of alternative mixing functions which are instead linear with respect to N. Every additional token costs the same amount of work. The typical method is to have a constant size state manipulated recurrently, which necessarily implies some level of lossy compression in the state (quadratic attention doesn't really have state in this sense - it computes and checks every possible relation always).
Linear attentions kind of suck in comparison to quadratic attention but the efficiency is very attractive, especially at inference time where you don't need more VRAM to store more context.
TLDR; conventional attentions scale N^2 time, N space (kv cache), and are exact. linear attentions scale N time, constant space (recurrent state), and are lossy.