Obviously working directly with audio is vastly more complex than with text.
But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.
I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.
I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.
I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.