Neural audio codecs: how to get audio into LLMs

(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0.228s | source

Show context

trollbridge ◴[21 Oct 25 13:34 UTC] No.45655616[source]▶

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #

nmfisher ◴[21 Oct 25 14:08 UTC] No.45656008[source]▶

>>45655616 #

The article is talking about doing exactly that. The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.

There are samples towards the end of the post of their LLM trained on their Mimi audio codec.

replies(3): >>45656089 #>>45656540 #>>45661540 #

duped ◴[21 Oct 25 14:50 UTC] No.45656540[source]▶

>>45656008 #

> the key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens

Did Claude Shannon not answer this question in 1948? You need at least 1 bit per 6dB of dynamic range for each symbol and 2B symbols per second where B is the bandwidth of the signal.

Compression techniques are all about getting below that fundamental limit but it's not like this is an unsolved problem. Or is 1kbaud too much for LLMs?

replies(1): >>45666166 #

1. nmfisher ◴[22 Oct 25 08:13 UTC] No.45666166[source]▶

>>45656540 #

Yes, quantization isn't anything new, nor are audio codecs. As you point out, though, it's not just about designing a quantization scheme to reconstruct an analog signal. The scheme itself needs to be "easy" for current model architectures to learn and decode autoregressively (ideally, in realtime on standard hardware).

The blog post addresses this directly with samples from their own baseline (an autoregressive mu-law vocoder), and from WaveNet (which was similar architecture). The sound is mostly recognizable as a human voice, but it's unintelligible. The sequence length is too long and the SNR for the encoding scheme is too low for an generative/autoregressive model to learn.

This is what the neural codec is intended to address. Decoupling semantic from acoustic modelling is an important step ("how our ears interpret a sound" vs. "what we need to reconstruct the exact acoustic signal"). Mimi works at 1.1kbps, and others work at low bitrates (descript, semanticodec, etc). Encodec runs at at a higher bitrate so generally delivers better audio quality.

Now - why are neural codecs easier to model than conventional parametric codecs? I don't know. Maybe they're not, maybe it's just an artifact of the transformer architecture (since semantic tokens are generally extracted from self-supervised models like WavLM). It's definitely an interesting question.

↑