Neural audio codecs: how to get audio into LLMs

(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0.001s | source

Show context

trollbridge ◴[21 Oct 25 13:34 UTC] No.45655616[source]▶

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #

nmfisher ◴[21 Oct 25 14:08 UTC] No.45656008[source]▶

>>45655616 #

The article is talking about doing exactly that. The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.

There are samples towards the end of the post of their LLM trained on their Mimi audio codec.

replies(3): >>45656089 #>>45656540 #>>45661540 #

1. generuso ◴[21 Oct 25 20:52 UTC] No.45661540[source]▶

>>45656008 #

One of the popular speech-to-text models is Whisper, which starts with the conventional spectral analysis of the speech signal, and then feeds the data into a Transformer model. It works quite well.

https://openai.com/index/whisper/

Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"

https://apps.dtic.mil/sti/tr/pdf/ADA206826.pdf

↑