Why is there an expectation that “nearby” tokens are relevant to increase the information in the similarities? That seems like it would hold true within individual words, but the whole point of attention was to solve long range dependencies. Reintroducing local windows seems like a step backwards to me.
replies(3):