←back to thread

210 points blackcat201 | 4 comments | | HN request time: 0.001s | source
Show context
eXpl0it3r ◴[] No.45769733[source]
For the uninitiated, what's a "hybrid linear attention architecture"?
replies(2): >>45769822 #>>45772001 #
quotemstr ◴[] No.45769822[source]
1/4 of their layers are conventional quadratic attention
replies(1): >>45770016 #
1. meowface ◴[] No.45770016[source]
Could someone explain every term in this subthread in a very simple way to someone who basically only knows "transformers are a neural network architecture that use something called 'attention' to consider the entire input the whole time or something like that", and who does not understand what "quadratic" even means in a time complexity or mathematical sense beyond that "quad" has something to do with the number four.

I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.

replies(3): >>45770266 #>>45771250 #>>45780732 #
2. moffkalast ◴[] No.45770266[source]
Afaik there are two types of attention, cross and self attention. It's quadratic becase you have to process one set of tokens with another, like calculating a matrix product. Originally designed for translation, you'd take tokens in one language on one side and the other language on the other, then compute the relevance of each word with each other which the model then uses further to more accurately generate the translation.

With self attention you compute every token in a sequence with every other token in that same sequence, figuring out which word references which other word (e.g. "George is sitting in the park. He's reading a book.", "He" would correlate with "George", letting the model know what it refers to). Of course these are also trained layers so what the model thinks correlates with what and how that info is used in the DNN perceptron part is depends wholly on the training process.

There is no free lunch with this and with only 1/4 of layers having it, the model will perform significantly worse at identifying relevant info and likely decohere a lot compared to having it on every layer. But since you get rid of the quadratic complicity, it'll be much faster. Think "I'm doing 1000 calculations per second and they're all wrong" meme. So far there have been lots of attempts at doing linear-ish attention (e.g. Google doing the sliding window hackery that only computes a part of the vectors and hopes for good locality, mamba combinations with RNNs, Meta removing positional encodings in attention in the trainwreck that was LLama4, etc.) and they've mostly failed, so the holy grail is finding a way to make it work since you get the best of both worlds. The top performing models today all use fully quadratic attention or combine it with sliding windows in some layers to claw back some speed in long context scenarios at the cost of some accuracy.

3. Zacharias030 ◴[] No.45771250[source]
Transformers try to give you capabilities by doing two things interleaved (in layers) multiple times:

- apply learned knowledge from its parameters to every part of the input representation („tokenized“, ie, chunkified text).

- apply mixing of the input representation with other parts of itself. This is called „attention“ for historical reasons. The original attention computes mixing of (roughly) every token (say N) with every other (N). Thus we pay a compute cost relative to N squared.

The attention cost therefore grows quickly in terms of compute and memory requirements when the input / conversation becomes long (or may even contain documents).

It is a very active field of research to reduce the quadratic part to something cheaper, but so far this has been rather difficult, because as you readily see this means that you have to give up mixing every part of the input with every other.

Most of the time mixing token representations close to each other is more important than those that are far apart, but not always. That’s why there are many attempts now to do away with most of the quadratic attention layers but keeping some.

What to do during mixing when you give up all-to-all attention is the big research question because many approaches seem to behave well only under some conditions and we haven’t established something as good and versatile as all-to-all attention.

If you forgo all-to-all you also open up so many options (eg. all-to-something followed by something-to-all as a pattern, where something serves as a sort of memory or state that summarizes all inputs at once. You can imagine that summarizing all inputs well is a lossy abstraction though, etc.)

4. hexaga ◴[] No.45780732[source]
There are different varieties of attention, which just amounts to some kind of learned mixing function between tokens in a sequence.

For an input of length N (tokens), the standard kind of attention requires N squared operations (hence, quadratic - it scales with the square of input length). You have to check how every token attends to every other token.

There are a bunch of alternative mixing functions which are instead linear with respect to N. Every additional token costs the same amount of work. The typical method is to have a constant size state manipulated recurrently, which necessarily implies some level of lossy compression in the state (quadratic attention doesn't really have state in this sense - it computes and checks every possible relation always).

Linear attentions kind of suck in comparison to quadratic attention but the efficiency is very attractive, especially at inference time where you don't need more VRAM to store more context.

TLDR; conventional attentions scale N^2 time, N space (kv cache), and are exact. linear attentions scale N time, constant space (recurrent state), and are lossy.