How does this compare with Byte Latent Transformer [1]? This happens with convolution post-embedding while BLT happens with attention at embedding time?
1. https://ai.meta.com/research/publications/byte-latent-transf...
replies(1):
1. https://ai.meta.com/research/publications/byte-latent-transf...