←back to thread

152 points fzliu | 1 comments | | HN request time: 0.435s | source
Show context
bigdict ◴[] No.43562732[source]
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?
replies(4): >>43562773 #>>43563245 #>>43563544 #>>43564050 #
1. jwilber ◴[] No.43562773[source]
There’s no one-size-fits-all answer here, but in my experience, for long contexts, perf for conv-based methods outperforms strictly attention-based methods. See evo2:

“With the current implementation of Evo2, we do not have the heavily optimized kernels in place for convolution operators like we do for attention layers in a model like llama2. Even with this shortcoming, we see that the benefit from including more convolutional layers makes up for the earlier stage of optimization at around the 64k context length. Beyond that point we see an improvement in performance even compared to a highly optimized transformer model.“

https://docs.nvidia.com/bionemo-framework/latest/models/evo2...