I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).
I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).
[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/