So, we're proposing a multiplicative increase of something that already scales quadratically with the context size?
I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.
replies(4):
> allowing nearby queries and keys to affect each other's attention weights for more precise attention
If it is only nearby tokens it is multiplicative by a constant right? Not making it cubic scaling with context length or anything.
Deepseek got a training performance increase with two tokens at a time, though it doesn't go into the final model inference like this. They did say it can be used for speculative decode to reduce inference costs though.
They may get away with less attention heads with this new approach too.