An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
replies(5):
That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.
There are samples towards the end of the post of their LLM trained on their Mimi audio codec.
https://openai.com/index/whisper/
Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"