←back to thread

213 points Philpax | 3 comments | | HN request time: 0s | source
Show context
valine ◴[] No.42169009[source]
One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.

For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.

replies(2): >>42171938 #>>42175164 #
bhickey ◴[] No.42171938[source]
Can you describe the behaviors that you can elicit with this technique?
replies(1): >>42175372 #
1. valine ◴[] No.42175372[source]
One strategy I’ve been playing around with is to take an instruction I want the model to follow and squish the positional encodings for the keys down to position zero, and the new queries out slightly further in the window. The model will still follow the instruction but the behaviors are more global. It’s behaves more like a fine-tune and less like the instruction is part of the conversation.
replies(1): >>42197544 #
2. bhickey ◴[] No.42197544[source]
> squish the positional encodings for the keys down to position zero

I might be misunderstanding, but wouldn't this turn your instructions into a bag of words?

replies(1): >>42206207 #
3. valine ◴[] No.42206207[source]
No, and that’s because we are talking about relative positions. Every query can have its own set of keys. From the perspective of token 100 token 3 would be squished down, but from the perspective of token 3 it is still at position 3 and can see tokens 0,1,2 without them being squished.