You could have designed state of the art positional encoding

1. 1024core ◴[18 Nov 24 18:14 UTC] No.42175184[source]▶

I didn't get the sudden leap from "position encodings" to "QKV" magic.

What is the connection between the two? Where does "Q" come from? What are "K" and "V"? (I know they stand for "Query", "Key", "Value"; but what do they have to do with position embeddings?)

replies(2): >>42175364 #>>42176614 #

2. flebron ◴[18 Nov 24 18:29 UTC] No.42175364[source]▶

>>42175184 (TP) #

All of them are vectors of embedded representations of tokens. In a transformer, you want to compute the inner product between a query (the token who is doing the attending) and the key (the token who is being attended to). An inductive bias we have is that the neural network's performance will be better if this inner product depends on the relative distance between the query token's position, and the key token's position. We thus encode each one with positional information, in such a way that (for RoPE at least) the inner product depends only on the distance between these tokens, and not their absolute positions in the input sentence.

3. FL33TW00D ◴[18 Nov 24 20:30 UTC] No.42176614[source]▶

>>42175184 (TP) #

"This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is expected."

If you're not sure on self attention, the post will be a little unclear