You could have designed state of the art positional encoding

1. valine ◴[18 Nov 24 01:45 UTC] No.42169009[source]▶

One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.

For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.

replies(2): >>42171938 #>>42175164 #

2. bhickey ◴[18 Nov 24 12:58 UTC] No.42171938[source]▶

>>42169009 (TP) #

Can you describe the behaviors that you can elicit with this technique?

replies(1): >>42175372 #

3. zackangelo ◴[18 Nov 24 18:13 UTC] No.42175164[source]▶

>>42169009 (TP) #

I'm surprised this is the case! I've been working on a rope implementation for my own project (needed to account for padding in unique situations) and even an off by one error usually causes the model to produce non-sensical output.

replies(1): >>42175269 #

4. valine ◴[18 Nov 24 18:21 UTC] No.42175269[source]▶

>>42175164 #

You have to be careful to keep the relative positions for adjacent and nearby tokens intact. The relative positions of distant tokens are less brittle.

5. valine ◴[18 Nov 24 18:29 UTC] No.42175372[source]▶

>>42171938 #

One strategy I’ve been playing around with is to take an instruction I want the model to follow and squish the positional encodings for the keys down to position zero, and the new queries out slightly further in the window. The model will still follow the instruction but the behaviors are more global. It’s behaves more like a fine-tune and less like the instruction is part of the conversation.

replies(1): >>42197544 #

6. bhickey ◴[20 Nov 24 20:01 UTC] No.42197544{3}[source]▶

>>42175372 #

> squish the positional encodings for the keys down to position zero

I might be misunderstanding, but wouldn't this turn your instructions into a bag of words?

replies(1): >>42206207 #

7. valine ◴[21 Nov 24 16:53 UTC] No.42206207{4}[source]▶

>>42197544 #

No, and that’s because we are talking about relative positions. Every query can have its own set of keys. From the perspective of token 100 token 3 would be squished down, but from the perspective of token 3 it is still at position 3 and can see tokens 0,1,2 without them being squished.