←back to thread

213 points Philpax | 1 comments | | HN request time: 0s | source
Show context
imjonse ◴[] No.42170648[source]
I don't think the first code example should work (it indeed says false here).

When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.

replies(2): >>42170863 #>>42172280 #
1. aconz2 ◴[] No.42172280[source]
To add on since this took me a while to understand: for a single token, self attention is permutation invariant because we take the qK (one query dot all the other keys) weighted sum of all the values; that sum is what gives the invariance because + is commutative. But for all the tokens, the mha output matrix will not be invariant, but rather equivariant, where you apply the same permutation to the output matrix as you did to the input tokens. What might be a more useful example is to take one position, like the last one, and compute its mha for every permutation of the previous tokens; those will/should all be the same.