You could have designed state of the art positional encoding

(fleetwood.dev)

216 points Philpax | 2 comments | 17 Nov 24 20:31 UTC | HN request time: 0.391s | source

Show context

valine ◴[18 Nov 24 01:45 UTC] No.42169009[source]▶

One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.

For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.

replies(2): >>42171938 #>>42175164 #

1. zackangelo ◴[18 Nov 24 18:13 UTC] No.42175164[source]▶

>>42169009 #

I'm surprised this is the case! I've been working on a rope implementation for my own project (needed to account for padding in unique situations) and even an off by one error usually causes the model to produce non-sensical output.

replies(1): >>42175269 #

2. valine ◴[18 Nov 24 18:21 UTC] No.42175269[source]▶

>>42175164 (TP) #

You have to be careful to keep the relative positions for adjacent and nearby tokens intact. The relative positions of distant tokens are less brittle.

↑