Human audio perception is based on detecting the frequency components, which we detect via what amounts to a filter bank in the inner ear (different length hairs with different resonant frequencies).
Speech perception builds upon frequencies and is based on "formants" - the frequency bands that are attentuated via the vocal tract resonances created by articulation when the speech was generated. More specifically, most speech information is contained in formant changes since these correspond to articulatory changes. There are also other articulatory artifacts in speech such as the onsets of speech energy corresponding to plosives ("puh", "buh"), and the high frequencies generated by fricatives like "sss".
One problem with embedding MP3 frames as audio tokens would be that although MP3 compression is based on frequency representation, you've then got quantization, huffman encoding and the MP3 frame structure all on top of that, so the frame as a whole is going to be more of a black box. Presumably a transformer could still use MP3 frames to predict the text transcription, or any arbitrary encoding of speech audio for that matter (similar to how an LLM can predict text from Base64 representation, or vice versa), but it's certainly not making it easier if the input is obfuscating the frequency components and formants etc that correspond to the generating process.
Not having direct access to the frequency/formant information is also going to make generalization more difficult since that is based around formant structure and changes. When articulating the same word, the specific formant frequencies will differ between individuals, primarily based on vocal tract length, but humans have no problem generalizing across these and understanding speech from different individuals. I'm not sure if an LLM only trained to predict MP3 speech from, say, male adults, would necessarily have generalized enough to also be able to recognize child speech or that from a speech synthesizer.