←back to thread

213 points Philpax | 4 comments | | HN request time: 0s | source
Show context
throwawaymaths ◴[] No.42169348[source]
Maybe someone could answer this for me: it seems like encoding the positional embeddings as augmentations to the "natural" activations instead of as their own inputs (concatenated onto the activations) make things like sliding a window much harder... I guess obviously the drawback is you have a somewhat less textually derived information.

I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.

Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?

replies(1): >>42170034 #
1. stephantul ◴[] No.42170034[source]
They are not concatenated, but summed. I think concatenation wouldn’t work, as you indicate.

I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.

replies(1): >>42170614 #
2. throwawaymaths ◴[] No.42170614[source]
> I think concatenation wouldn’t work, as you indicate.

Why do you say that?

replies(1): >>42171738 #
3. donkeyboy ◴[] No.42171738[source]
Concat could work too although less efficient because you need to make a new tensor.

Actually summing might learn a concat on its own. Imagine the embedding learned for a token takes up the first N-20 dimensions and leaves the last 20 dimensions as 0. And the positional encoding causes the first N-20 dims to be 0 and the last 20 to encode the information. Then when you sum you are actually concatenating. So I think of them as equivalent except add is more efficient/preserves the dim space, while concat would grow the dim space. And for something like position, which certainly does not need to occupy 1000+ dimensions, it would not make sense to concat all of that since it would be wasteful

replies(1): >>42175913 #
4. throwawaymaths ◴[] No.42175913{3}[source]
why would you need to make a new tensor?

Suppose you had a 4096 (llama-2) sized activations. Maybe, you make do with 3084 activations and concatenate 1024 positional activations onto that.

Then you pass that to Mk Mq Mv and generate K, Q, V.

The only thing that would change would be the Mff-out, which would now be a (big)x3084 matrix instead of (big)x4096

In any case you would be retraining, so changing the dims of the tensors I think is not a big deal... In fact in this case they would be smaller (at the cost of fewer interlayer activations), but you would have the same number of tensors.

> Actually summing might learn a concat on its own.

But you see the point? You're forcing the model to learn something that maybe it didn't need to. That's like saying "well a fully connected network might learn convolution on its own". Historically breakthroughs in capability have accompanied one of: [more data | more layers | smarter constraints on activations]

Unless you have some sort of argument that forcing it to learn position has carryover value in generating activations, it seems, naively, a bad idea.