Neural audio codecs: how to get audio into LLMs

(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0s | source

Show context

bob1029 ◴[21 Oct 25 14:16 UTC] No.45656123[source]▶

Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #

PaulDavisThe1st ◴[21 Oct 25 14:20 UTC] No.45656175[source]▶

>>45656123 #

The approach in TFA encodes into a 32 dimensional space. I suspect this is significantly more dimensions than any psycho-acoustic compression algorithm uses. Also, throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

replies(1): >>45656394 #

bob1029 ◴[21 Oct 25 14:37 UTC] No.45656394[source]▶

>>45656175 #

> throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?

replies(3): >>45656507 #>>45656712 #>>45656731 #

1. CaptainOfCoit ◴[21 Oct 25 14:48 UTC] No.45656507[source]▶

>>45656394 #

Maybe that things outside our audible range could impact/influence things inside of our audible range?

↑