Neural audio codecs: how to get audio into LLMs

1. bob1029 ◴[21 Oct 25 14:16 UTC] No.45656123[source]▶

Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #

2. PaulDavisThe1st ◴[21 Oct 25 14:20 UTC] No.45656175[source]▶

>>45656123 (TP) #

The approach in TFA encodes into a 32 dimensional space. I suspect this is significantly more dimensions than any psycho-acoustic compression algorithm uses. Also, throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

replies(1): >>45656394 #

3. bob1029 ◴[21 Oct 25 14:37 UTC] No.45656394[source]▶

>>45656175 #

> throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?

replies(3): >>45656507 #>>45656712 #>>45656731 #

4. CaptainOfCoit ◴[21 Oct 25 14:48 UTC] No.45656507{3}[source]▶

>>45656394 #

Maybe that things outside our audible range could impact/influence things inside of our audible range?

5. 542354234235 ◴[21 Oct 25 15:03 UTC] No.45656712{3}[source]▶

>>45656394 #

I imagine it would be like if there were Rosetta Stones of text, written with a language you could read and a language you couldn't. For your purposes, discarding the text you can't read would be fine and you wouldn't lose anything. But if you were ingesting a bunch into an LLM, the additional text would give the LLM more context and help it make connections and relate words more accurately, even if you never were going to have it output anything in the language you don't understand.

The inaudible sounds add context and additional datapoints on how the audible sounds are related.

6. PaulDavisThe1st ◴[21 Oct 25 15:05 UTC] No.45656731{3}[source]▶

>>45656394 #

Humans that read (at least) Indo-European languages can read texts in their native language with all the vowels removed. Does that suggest that it would be a good idea to remove the vowels from text before using it for training text-based LLMs ?

Presumably you want to train on as rich a set of data as possible, even if some of that data is redundant or irrelevant when it comes to human perception.

replies(1): >>45656983 #

7. ACCount37 ◴[21 Oct 25 15:09 UTC] No.45656782[source]▶

>>45656123 (TP) #

You can try to train an adapter from a raw 400-byte MP3 frame to an embedding for a given LLM (4096+ floating point numbers, exact precision varies).

But you'd need that information to be digestible for a neural network. Otherwise, you'll have a very hard time getting that adapter to work.

As a rule: neural networks love highly redundant data, and hate highly compressed data at their inputs. Tokenized text good, GZIP compressed bytestream bad. But who knows, really. It's a rule of thumb, not a mathematical law. So you could have some success getting that MP3-based adapter to work. I've seen weirder shit work.

replies(1): >>45660360 #

8. WithinReason ◴[21 Oct 25 15:15 UTC] No.45656867[source]▶

>>45656123 (TP) #

JPEG is a good idea:

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.

https://proceedings.neurips.cc/paper_files/paper/2018/file/7...

I suspect mp3 is also a good idea

9. Tangurena2 ◴[21 Oct 25 15:25 UTC] No.45656983{4}[source]▶

>>45656731 #

Generally, the difference between regional dialects is almost all in vowels (sample: 0). This is why SOUNDEX [1] eliminated vowels.

0 - https://www.acelinguist.com/2020/01/the-pin-pen-merger.html

1 - https://en.wikipedia.org/wiki/Soundex

10. cubefox ◴[21 Oct 25 15:27 UTC] No.45657008[source]▶

>>45656123 (TP) #

I believe language models usually use 2-byte (16 bit) tokens, which corresponds to an embedding dimension of 2^16=65536. With 400 bytes per token this would be 2^(400*8), which is an extremely large number. Way too large to be practical, I assume.

11. vvolhejn ◴[21 Oct 25 15:58 UTC] No.45657386[source]▶

>>45656123 (TP) #

Author here. There are a few reasons, but the biggest one is simply the compression ratio.

The OG neural audio codec SoundStream (whose first author is Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3 typically has around 128kbps, as you say. Interestingly, it was originally developed for audio compression for Google Meet, not for LLMs. Today's neural codecs have even better compression.

The more modern MP3 alternative is Opus, which can work ok at 12kbps, but it's still less efficient than neural audio codecs. However, these traditional codecs are a lot less CPU-hungry, so they have that going for them.

replies(1): >>45659976 #

12. HarHarVeryFunny ◴[21 Oct 25 16:30 UTC] No.45657808[source]▶

>>45656123 (TP) #

Human audio perception is based on detecting the frequency components, which we detect via what amounts to a filter bank in the inner ear (different length hairs with different resonant frequencies).

Speech perception builds upon frequencies and is based on "formants" - the frequency bands that are attentuated via the vocal tract resonances created by articulation when the speech was generated. More specifically, most speech information is contained in formant changes since these correspond to articulatory changes. There are also other articulatory artifacts in speech such as the onsets of speech energy corresponding to plosives ("puh", "buh"), and the high frequencies generated by fricatives like "sss".

One problem with embedding MP3 frames as audio tokens would be that although MP3 compression is based on frequency representation, you've then got quantization, huffman encoding and the MP3 frame structure all on top of that, so the frame as a whole is going to be more of a black box. Presumably a transformer could still use MP3 frames to predict the text transcription, or any arbitrary encoding of speech audio for that matter (similar to how an LLM can predict text from Base64 representation, or vice versa), but it's certainly not making it easier if the input is obfuscating the frequency components and formants etc that correspond to the generating process.

Not having direct access to the frequency/formant information is also going to make generalization more difficult since that is based around formant structure and changes. When articulating the same word, the specific formant frequencies will differ between individuals, primarily based on vocal tract length, but humans have no problem generalizing across these and understanding speech from different individuals. I'm not sure if an LLM only trained to predict MP3 speech from, say, male adults, would necessarily have generalized enough to also be able to recognize child speech or that from a speech synthesizer.

13. espadrine ◴[21 Oct 25 18:51 UTC] No.45659976[source]▶

>>45657386 #

That makes sense.

Why RVQ though, rather than using the raw VAE embedding?

If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?

replies(1): >>45660656 #

14. a-dub ◴[21 Oct 25 19:19 UTC] No.45660360[source]▶

>>45656782 #

if you were able to normalize and quantokenize the distinct dct values in a consistent way, it could be an interesting approach. so yeah, undo the bit packing but keep the front end signal processing and compressed dct representation and viola! something quite weird that might actually work. :)

15. vvolhejn ◴[21 Oct 25 19:40 UTC] No.45660656{3}[source]▶

>>45659976 #

I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.

But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.

Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759

At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926

replies(1): >>45663955 #

16. programjames ◴[22 Oct 25 01:30 UTC] No.45663955{4}[source]▶

>>45660656 #

Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.

replies(1): >>45677175 #

17. programjames ◴[23 Oct 25 01:25 UTC] No.45677175{5}[source]▶

>>45663955 #

Update 1: After ~12 hours of training and 45 epochs on CIFAR, I'm starting to see textures.

https://imgur.com/MzKUKhH