Multi-Token Attention

(arxiv.org)

152 points fzliu | 3 comments | 02 Apr 25 22:20 UTC | HN request time: 0.001s | source

Show context

jwilber ◴[02 Apr 25 22:56 UTC] No.43562736[source]▶

Achieved by “applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention”

Cool to see convolutions making such a comeback lately in the llm world. See also the recent striped hyena2 architecture, which uses the conv-based hyena operator to great success:

https://arxiv.org/abs/2503.01868

replies(1): >>43563136 #

janalsncm ◴[02 Apr 25 23:45 UTC] No.43563136[source]▶

>>43562736 #

The null hypothesis is more compute or bigger network = better results. Conv operations make sense on images because the data is naturally 2 dimensional, so applying an operation across a sliding window makes sense.

Skimming the paper, I don’t see them testing against e.g. a normal decoder with an extra layer or something.

I don’t see the same logic applying on an embedding, where the individual indexes matter. Adjacent indexes in an embedding have no relationship, unlike adjacent pixels in an image.

replies(2): >>43563236 #>>43564381 #

1. pizza ◴[02 Apr 25 23:57 UTC] No.43563236[source]▶

>>43563136 #

They do have a weak relationship, in that earlier index tokens were encountered earlier during the formation of the vocabulary, so they are similar in typicality

replies(1): >>43565162 #

2. janalsncm ◴[03 Apr 25 05:36 UTC] No.43565162[source]▶

>>43563236 (TP) #

No, if you check the diagram (page 2) these are literally indexes into the KV vectors, not positional indexes in the text. If it was the text I would agree with you.

replies(1): >>43578321 #

3. pizza ◴[04 Apr 25 04:26 UTC] No.43578321[source]▶

>>43565162 #

Oh, I thought you were talking about unorderedness in embedding indices in a general context, to which I was responding to the specific case of vocab embedding indices having a correlation - my apologies

↑