Neural audio codecs: how to get audio into LLMs

(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0s | source

Show context

bob1029 ◴[21 Oct 25 14:16 UTC] No.45656123[source]▶

Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #

PaulDavisThe1st ◴[21 Oct 25 14:20 UTC] No.45656175[source]▶

>>45656123 #

The approach in TFA encodes into a 32 dimensional space. I suspect this is significantly more dimensions than any psycho-acoustic compression algorithm uses. Also, throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

replies(1): >>45656394 #

bob1029 ◴[21 Oct 25 14:37 UTC] No.45656394{3}[source]▶

>>45656175 #

> throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?

replies(3): >>45656507 #>>45656712 #>>45656731 #

1. 542354234235 ◴[21 Oct 25 15:03 UTC] No.45656712{4}[source]▶

>>45656394 #

I imagine it would be like if there were Rosetta Stones of text, written with a language you could read and a language you couldn't. For your purposes, discarding the text you can't read would be fine and you wouldn't lose anything. But if you were ingesting a bunch into an LLM, the additional text would give the LLM more context and help it make connections and relate words more accurately, even if you never were going to have it output anything in the language you don't understand.

The inaudible sounds add context and additional datapoints on how the audible sounds are related.

↑