Neural audio codecs: how to get audio into LLMs

1. crazygringo ◴[21 Oct 25 14:08 UTC] No.45656005[source]▶

This is fascinating.

Obviously working directly with audio is vastly more complex than with text.

But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.

I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.

I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.

I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.

replies(3): >>45656334 #>>45657049 #>>45660804 #

2. quinndupont ◴[21 Oct 25 14:32 UTC] No.45656334[source]▶

>>45656005 (TP) #

There’s a long history of attempts at artificial speech that take this approach, recreating mouth parts and vibrating air. They are all pretty silly, like this work, which fails to understand how writing isn’t just a derivative of speech.

replies(1): >>45656743 #

3. crazygringo ◴[21 Oct 25 15:06 UTC] No.45656743[source]▶

>>45656334 #

> They are all pretty silly,

Huh? How?

> like this work which fails to understand how writing isn’t just a derivative of speech.

The whole point of the article is that writing isn't just a derivative of speech. It's in the introduction.

4. duped ◴[21 Oct 25 15:31 UTC] No.45657049[source]▶

>>45656005 (TP) #

In speech coding/synthesis this called a "source-filter" model (decompose speech production into a sound generator in the vocal folds and filter in the vocal tract, and parameterize them) and it's actually older than Tukey and Cooley's rediscovery of the FFT.

5. vvolhejn ◴[21 Oct 25 19:50 UTC] No.45660804[source]▶

>>45656005 (TP) #

Author here, thanks for the kind words! I think such a physics-based codec is unlikely to happen: in general, machine learning is always moving from handcrafted domain-specific assumptions to leaving as much as possible to the model. The more assumptions you bake in, the smaller the space of sounds you can model, so the quality is capped. Basically, modern ML is just about putting the right data into transformers.

That being said, having a more constrained model can also lead to some really cool stuff. The DDSP paper learns how to control a synthesizer to mimic instruments: https://arxiv.org/abs/2001.04643

You could probably do something similar for a speech model. The result would not sound as good but you could get away with much fewer parameters, because much of the modelling work is done by the assumptions you put in.

Compare also KokoroTTS, a tiny TTS that's so tiny because it uses a handcrafted system to turn text into phonemes, and then just synthesizes from those phonemes: https://huggingface.co/spaces/hexgrad/Kokoro-TTS