←back to thread

152 points fzliu | 1 comments | | HN request time: 0.233s | source
Show context
bob1029 ◴[] No.43562889[source]

So, we're proposing a multiplicative increase of something that already scales quadratically with the context size?

I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.

replies(4): >>43563169 #>>43563334 #>>43563390 #>>43563970 #
1. cma ◴[] No.43563334[source]

> allowing nearby queries and keys to affect each other's attention weights for more precise attention

If it is only nearby tokens it is multiplicative by a constant right? Not making it cubic scaling with context length or anything.

Deepseek got a training performance increase with two tokens at a time, though it doesn't go into the final model inference like this. They did say it can be used for speculative decode to reduce inference costs though.

They may get away with less attention heads with this new approach too.