The bitter lesson is coming for tokenization

This gave me an idea: we can take a mixture of tokenizations with learned weights, just like taking a mixture of experts with learned weights. BLT is optimized for compression, but an approach like this could be optimized directly for model performance, and really learn to skim.

Concretely: we learn a medium sized model that takes a partial tokenization and outputs a probability distribution over the endpoints of the next token (say we let the token lengths range from 1 to 64 bytes, the model outputs 64 logits). Then we do a beam search to find the, say, 4 most likely tokenizations. Then we run the transformer on all four tokenizations, and we take the expected value of the loss to be the final loss.

If we train this on prompt-response pairs, so that it only has to learn what to say and doesn't have to predict the context, then it could learn to skim boring stuff by patching it into ~64 byte tokens. Or more if we want.

And ofc we'd use a short context byte level transformer to encode/decode tokens to vectors. Idk this idea is kinda half baked.