Most active commenters

dotancohen(3)

Popular/hot comments

>>45657291 #
>>45657486 #
>>45656667 #
>>45656700 #

←back to thread

Neural audio codecs: how to get audio into LLMs

(kyutai.org)

1. miki123211 ◴[21 Oct 25 14:28 UTC] No.45656279[source]▶

>>45655161 (OP) #

> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #

2. sbrother ◴[21 Oct 25 14:39 UTC] No.45656408[source]▶

>>45656279 (TP) #

I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".

replies(1): >>45656700 #

3. idonotknowwhy ◴[21 Oct 25 14:44 UTC] No.45656467[source]▶

>>45656279 (TP) #

Qwen3 omni transcriber can do this. It can describe the voice, emotion very well

replies(1): >>45657007 #

4. tsol ◴[21 Oct 25 15:00 UTC] No.45656667[source]▶

>>45656279 (TP) #

Did they respond differently depending on what race they thought you were? I'm surprised they would even do that honestly. I thought they were trained on text conversations which presumably wouldn't have any of that to learn from.

replies(4): >>45656799 #>>45656985 #>>45657478 #>>45664768 #

5. bigzyg33k ◴[21 Oct 25 15:02 UTC] No.45656700[source]▶

>>45656408 #

advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.

replies(3): >>45656930 #>>45656981 #>>45656999 #

6. OisinMoran ◴[21 Oct 25 15:10 UTC] No.45656799[source]▶

>>45656667 #

You can often tell where someone is from from text alone! There are plenty of idiosyncrasies even in how different English speaking countries use the language.

replies(2): >>45656828 #>>45657486 #

7. anotherhue ◴[21 Oct 25 15:12 UTC] No.45656828{3}[source]▶

>>45656799 #

Ah stop

8. cubefox ◴[21 Oct 25 15:21 UTC] No.45656930{3}[source]▶

>>45656700 #

But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.

replies(1): >>45657713 #

9. sbrother ◴[21 Oct 25 15:25 UTC] No.45656981{3}[source]▶

>>45656700 #

right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.

10. thwarted ◴[21 Oct 25 15:25 UTC] No.45656985[source]▶

>>45656667 #

If it did, it responded based on the accent it picked up on not race, because race and accent are orthogonal, correlation does not imply causation.

replies(1): >>45659653 #

11. oezi ◴[21 Oct 25 15:26 UTC] No.45656999{3}[source]▶

>>45656700 #

Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.

replies(1): >>45662122 #

12. 85392_school ◴[21 Oct 25 15:27 UTC] No.45657007[source]▶

>>45656467 #

I've also had luck with Gemini. If I made a few noises and asked which one was higher pitched, it could easily tell.

13. vvolhejn ◴[21 Oct 25 15:51 UTC] No.45657291[source]▶

>>45656279 (TP) #

Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section.

Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.

replies(5): >>45657465 #>>45657812 #>>45657913 #>>45661357 #>>45662630 #

14. j45 ◴[21 Oct 25 16:04 UTC] No.45657465[source]▶

>>45657291 #

Accent detection or consciously ignoring it is a filter step.

15. j45 ◴[21 Oct 25 16:05 UTC] No.45657478[source]▶

>>45656667 #

There are subtle differences in language where two groups can be speaking English and one is having a completely different conversation without saying much.

replies(1): >>45659642 #

16. fragmede ◴[21 Oct 25 16:05 UTC] No.45657486{3}[source]▶

>>45656799 #

Like, what do you mean? Are there, like, particular mannerisms that people from some regions that are hella unique to those regions?

replies(5): >>45657662 #>>45657679 #>>45660771 #>>45660775 #>>45664749 #

17. robotresearcher ◴[21 Oct 25 16:19 UTC] No.45657662{4}[source]▶

>>45657486 #

I say old chap, what colour are your mummy’s wellies?

18. ctxc ◴[21 Oct 25 16:20 UTC] No.45657679{4}[source]▶

>>45657486 #

Clever!

19. bigzyg33k ◴[21 Oct 25 16:23 UTC] No.45657713{4}[source]▶

>>45656930 #

we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.

I don't find the articles argument that this is due to tokenisation convincing.

replies(1): >>45658310 #

20. JoshTriplett ◴[21 Oct 25 16:30 UTC] No.45657812[source]▶

>>45657291 #

Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.

21. oezi ◴[21 Oct 25 16:39 UTC] No.45657913[source]▶

>>45657291 #

> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion

replies(1): >>45657956 #

22. jasonjayr ◴[21 Oct 25 16:42 UTC] No.45657956{3}[source]▶

>>45657913 #

IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.

replies(1): >>45659615 #

23. cubefox ◴[21 Oct 25 17:06 UTC] No.45658310{5}[source]▶

>>45657713 #

They didn't say it's due to tokenization.

> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.

24. bongodongobob ◴[21 Oct 25 17:46 UTC] No.45658995[source]▶

>>45656279 (TP) #

Hmm, the last time I played with GPT voice mode it was able to do all kinds of different accents.

replies(1): >>45666270 #

25. dotancohen ◴[21 Oct 25 18:26 UTC] No.45659615{4}[source]▶

>>45657956 #

Where can I read about this?

replies(1): >>45664506 #

26. dotancohen ◴[21 Oct 25 18:27 UTC] No.45659642{3}[source]▶

>>45657478 #

This is quite the reason my wife evolved into my ex-wife.

27. dotancohen ◴[21 Oct 25 18:28 UTC] No.45659653{3}[source]▶

>>45656985 #

Are denying that race and accent are highly correlated?

replies(1): >>45663273 #

28. ◴[21 Oct 25 19:48 UTC] No.45660771{4}[source]▶

>>45657486 #

29. ElevenLathe ◴[21 Oct 25 19:48 UTC] No.45660775{4}[source]▶

>>45657486 #

You betcha!

30. smusamashah ◴[21 Oct 25 20:35 UTC] No.45661357[source]▶

>>45657291 #

There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.

31. fragmede ◴[21 Oct 25 21:45 UTC] No.45662122{4}[source]▶

>>45656999 #

Which "it" are you referring to? There are models that can.

32. wordglyph ◴[21 Oct 25 22:38 UTC] No.45662630[source]▶

>>45657291 #

I used aistudio and it understood pitch and and even emotion with an uploaded mp3

33. thwarted ◴[21 Oct 25 23:52 UTC] No.45663273{4}[source]▶

>>45659653 #

No, I'm saying that it is more meaningful to use what is directly derived rather than what is an indirect assumption. There is already issues with people erroneously considering whatever LLMs output as truth, the last thing anyone needs is an LLM claiming someone like Idris Elba is a white Briton because of his accent. We don't need automated phrenology machines, and that's what "determined your race from your voice" is pretty close to.

34. jasonjayr ◴[22 Oct 25 03:05 UTC] No.45664506{5}[source]▶

>>45659615 #

> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

replies(1): >>45667145 #

35. xwolfi ◴[22 Oct 25 03:56 UTC] No.45664749{4}[source]▶

>>45657486 #

All my Indian colleagues say "I agree with the same", this "the same" turn of phrase was so strange to me I had to ask (I'm French, so I have my own silly quirks, like I forget non-vocal plural(s<-- see, often I don't write that s)). They told me it was like that in Hindi so they just reproduce the pattern and it's grammatically acceptable.

For French people like me, false friends are immediately noticeable: for instance, "actually" to mean "now" instead of "in fact".

replies(1): >>45677405 #

36. vessenes ◴[22 Oct 25 04:00 UTC] No.45664768[source]▶

>>45656667 #

Pre-nerf the 4o voice model had a wide range of expressivity, and it would match affect (still tries to do this) and idiolect of listeners if asked. Nowadays there's a list of accents that are considered "hate-ish" and a list that aren't.

I will elide the rant inside me that west coast 20 somethings get to decide if speaking in a certain accent is racist or "bad". But it's a heartfelt rant.

37. userbinator ◴[22 Oct 25 06:14 UTC] No.45665432[source]▶

>>45656279 (TP) #

accent matching (if you sound Indian, it shouldn't also sound Indian)

Why not? I've found that it helps greatly with mutual intelligibility when both sides are speaking a similar dialect, and the one who can do this switching, switches to that of the one who can't.

(I wish I could also use an Indian accent bidirectionally; would definitely come in handy for those aggravating times I've had to talk to an outsourced customer service department.)

replies(1): >>45666238 #

38. philipallstar ◴[22 Oct 25 08:24 UTC] No.45666238[source]▶

>>45665432 #

This isn't a logic thing, it's a politeness law invented by white liberal western women thing.

replies(1): >>45670543 #

39. phrotoma ◴[22 Oct 25 08:28 UTC] No.45666270[source]▶

>>45658995 #

Like others I noticed this capability was interfered with in some way. I had fun getting it to speak to me in a cheesy over-the-top Bostonian accent early on, then one day when I tried to demonstrate for a friend it interrupted itself mid-sentence, literally one voice speaking over the other truncated voice, saying something like "I'm sorry I can't mimic voices".

It seemed like they had one model monitoring the output of another model and then cutting it off when it crossed some line.

40. vvolhejn ◴[22 Oct 25 10:36 UTC] No.45667145{6}[source]▶

>>45664506 #

I had no idea this existed, the internet is amazing

41. xp84 ◴[22 Oct 25 15:23 UTC] No.45670543{3}[source]▶

>>45666238 #

Exactly this. It fits that worldview to think a literal computer was “being racist” and mocking the user, even just by copying their speech patterns accurately.

42. Xmd5a ◴[23 Oct 25 02:07 UTC] No.45677405{5}[source]▶

>>45664749 #

>like I forget non-vocal plural(s<-- see, often I don't write that s)).

c'est infernal. infernal

↑