(arxiv.org)

152 points fzliu | 1 comments | 02 Apr 25 22:20 UTC | HN request time: 0s | source

Show context

cgearhart ◴[03 Apr 25 01:19 UTC] No.43563667[source]▶

Why is there an expectation that “nearby” tokens are relevant to increase the information in the similarities? That seems like it would hold true within individual words, but the whole point of attention was to solve long range dependencies. Reintroducing local windows seems like a step backwards to me.

replies(3): >>43563701 #>>43563870 #>>43564611 #

1. jsenn ◴[03 Apr 25 01:57 UTC] No.43563870[source]▶

>>43563667 #

This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.

A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”

↑

Multi-Token Attention