←back to thread

152 points fzliu | 1 comments | | HN request time: 0.22s | source
Show context
cgearhart ◴[] No.43563667[source]
Why is there an expectation that “nearby” tokens are relevant to increase the information in the similarities? That seems like it would hold true within individual words, but the whole point of attention was to solve long range dependencies. Reintroducing local windows seems like a step backwards to me.
replies(3): >>43563701 #>>43563870 #>>43564611 #
1. energy123 ◴[] No.43564611[source]
It's a little more inductive bias. That's not necessarily a step backwards. You need the right amount of inductive bias for a given data size and model capacity, no more and no less. Transformers already make the inductive bias of temporal locality by being causal.