On a related note, one thing I still do not understand is why are positional encodings 'added' to the token embeddings as opposed to (having a smaller position encoding vector that is) 'concatenated'. It would be great if someone could explain.
replies(1):