Concretely: we learn a medium sized model that takes a partial tokenization and outputs a probability distribution over the endpoints of the next token (say we let the token lengths range from 1 to 64 bytes, the model outputs 64 logits). Then we do a beam search to find the, say, 4 most likely tokenizations. Then we run the transformer on all four tokenizations, and we take the expected value of the loss to be the final loss.
If we train this on prompt-response pairs, so that it only has to learn what to say and doesn't have to predict the context, then it could learn to skim boring stuff by patching it into ~64 byte tokens. Or more if we want.
And ofc we'd use a short context byte level transformer to encode/decode tokens to vectors. Idk this idea is kinda half baked.