Neural audio codecs: how to get audio into LLMs

I wonder if a linear-space, constant-time model like RWKV or S4 would work better here. For audio, I wouldn't think you'd need long range context, and all-to-all mapping seems like overkill.

Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.

They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.

Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.