Multi-Token Attention | slacker news

> allowing nearby queries and keys to affect each other's attention weights for more precise attention

If it is only nearby tokens it is multiplicative by a constant right? Not making it cubic scaling with context length or anything.

Deepseek got a training performance increase with two tokens at a time, though it doesn't go into the final model inference like this. They did say it can be used for speculative decode to reduce inference costs though.

They may get away with less attention heads with this new approach too.