(fleetwood.dev)

216 points Philpax | 2 comments | 17 Nov 24 20:31 UTC | HN request time: 0.393s | source

Show context

alok-g ◴[18 Nov 24 22:53 UTC] No.42178008[source]▶

On a related note, one thing I still do not understand is why are positional encodings 'added' to the token embeddings as opposed to (having a smaller position encoding vector that is) 'concatenated'. It would be great if someone could explain.

replies(1): >>42178343 #

1. d3m0t3p ◴[18 Nov 24 23:26 UTC] No.42178343[source]▶

>>42178008 #

Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection.

h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V

h = W_o @ concat(h_1...h_8)

replies(1): >>42180788 #

2. bfelbo ◴[19 Nov 24 07:18 UTC] No.42180788[source]▶

>>42178343 (TP) #

How many dimensions would you need to increase by to capture positional information?

Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?

↑

You could have designed state of the art positional encoding