Neural audio codecs: how to get audio into LLMs

(kyutai.org)

Show context

miki123211 ◴[21 Oct 25 14:28 UTC] No.45656279[source]▶

> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #

sbrother ◴[21 Oct 25 14:39 UTC] No.45656408[source]▶

>>45656279 #

I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".

replies(1): >>45656700 #

1. bigzyg33k ◴[21 Oct 25 15:02 UTC] No.45656700{3}[source]▶

>>45656408 #

advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.

replies(3): >>45656930 #>>45656981 #>>45656999 #

2. cubefox ◴[21 Oct 25 15:21 UTC] No.45656930[source]▶

>>45656700 (TP) #

But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.

replies(1): >>45657713 #

3. sbrother ◴[21 Oct 25 15:25 UTC] No.45656981[source]▶

>>45656700 (TP) #

right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.

4. oezi ◴[21 Oct 25 15:26 UTC] No.45656999[source]▶

>>45656700 (TP) #

Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.

replies(1): >>45662122 #

5. bigzyg33k ◴[21 Oct 25 16:23 UTC] No.45657713[source]▶

>>45656930 #

we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.

I don't find the articles argument that this is due to tokenisation convincing.

replies(1): >>45658310 #

6. cubefox ◴[21 Oct 25 17:06 UTC] No.45658310{3}[source]▶

>>45657713 #

They didn't say it's due to tokenization.

> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.

7. fragmede ◴[21 Oct 25 21:45 UTC] No.45662122[source]▶

>>45656999 #

Which "it" are you referring to? There are models that can.

↑