Achieved by “applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention”
Cool to see convolutions making such a comeback lately in the llm world. See also the recent striped hyena2 architecture, which uses the conv-based hyena operator to great success:
replies(1):