←back to thread

213 points Philpax | 3 comments | | HN request time: 0.602s | source
1. alok-g ◴[] No.42178008[source]
On a related note, one thing I still do not understand is why are positional encodings 'added' to the token embeddings as opposed to (having a smaller position encoding vector that is) 'concatenated'. It would be great if someone could explain.
replies(1): >>42178343 #
2. d3m0t3p ◴[] No.42178343[source]
Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection.

h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V

h = W_o @ concat(h_1...h_8)

replies(1): >>42180788 #
3. bfelbo ◴[] No.42180788[source]
How many dimensions would you need to increase by to capture positional information?

Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?