←back to thread

425 points karimf | 1 comments | | HN request time: 0s | source
Show context
daxfohl ◴[] No.45657955[source]
I wonder if a linear-space, constant-time model like RWKV or S4 would work better here. For audio, I wouldn't think you'd need long range context, and all-to-all mapping seems like overkill.

Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.

They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.

Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.

replies(3): >>45658525 #>>45660560 #>>45663963 #
1. vvolhejn ◴[] No.45660560[source]
I don't know about linear models, but this kind of hierarchical modelling is quite a common idea in speech research. For example, OpenAI's Jukebox (2020) [1], which uses a proto-neural audio codec, has three levels of encoding that get coarser and coarser. They use a language model to predict continuations in the coarsest level and then have models to upscale to the finer levels and finally back to audio.

The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]

[1] https://arxiv.org/abs/2005.00341

[2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi...