Multi-Token Attention | slacker news

There’s no one-size-fits-all answer here, but in my experience, for long contexts, perf for conv-based methods outperforms strictly attention-based methods. See evo2:

“With the current implementation of Evo2, we do not have the heavily optimized kernels in place for convolution operators like we do for attention layers in a model like llama2. Even with this shortcoming, we see that the benefit from including more convolutional layers makes up for the earlier stage of optimization at around the 64k context length. Beyond that point we see an improvement in performance even compared to a highly optimized transformer model.“

https://docs.nvidia.com/bionemo-framework/latest/models/evo2...