(arxiv.org)

152 points fzliu | 2 comments | 02 Apr 25 22:20 UTC | HN request time: 0.517s | source

1. bionhoward ◴[02 Apr 25 23:31 UTC] No.43563017[source]▶

How does this compare with Byte Latent Transformer [1]? This happens with convolution post-embedding while BLT happens with attention at embedding time?

1. https://ai.meta.com/research/publications/byte-latent-transf...

replies(1): >>43563056 #

2. janalsncm ◴[02 Apr 25 23:35 UTC] No.43563056[source]▶

>>43563017 (TP) #

As I understand it, BLT uses a small nn to tokenize but doesn’t change the attention mechanism. MTA uses traditional BPE for tokenization but changes the attention mechanism. You could use both (latency be damned!)

↑

Multi-Token Attention