←back to thread

425 points karimf | 4 comments | | HN request time: 0.001s | source
Show context
bob1029 ◴[] No.45656123[source]
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #
vvolhejn ◴[] No.45657386[source]
Author here. There are a few reasons, but the biggest one is simply the compression ratio.

The OG neural audio codec SoundStream (whose first author is Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3 typically has around 128kbps, as you say. Interestingly, it was originally developed for audio compression for Google Meet, not for LLMs. Today's neural codecs have even better compression.

The more modern MP3 alternative is Opus, which can work ok at 12kbps, but it's still less efficient than neural audio codecs. However, these traditional codecs are a lot less CPU-hungry, so they have that going for them.

replies(1): >>45659976 #
1. espadrine ◴[] No.45659976[source]
That makes sense.

Why RVQ though, rather than using the raw VAE embedding?

If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?

replies(1): >>45660656 #
2. vvolhejn ◴[] No.45660656[source]
I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.

But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.

Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759

At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926

replies(1): >>45663955 #
3. programjames ◴[] No.45663955[source]
Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.
replies(1): >>45677175 #
4. programjames ◴[] No.45677175{3}[source]
Update 1: After ~12 hours of training and 45 epochs on CIFAR, I'm starting to see textures.

https://imgur.com/MzKUKhH