Most active commenters
  • dotancohen(3)

←back to thread

425 points karimf | 42 comments | | HN request time: 1.072s | source | bottom
1. miki123211 ◴[] No.45656279[source]
> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #
2. sbrother ◴[] No.45656408[source]
I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".
replies(1): >>45656700 #
3. idonotknowwhy ◴[] No.45656467[source]
Qwen3 omni transcriber can do this. It can describe the voice, emotion very well
replies(1): >>45657007 #
4. tsol ◴[] No.45656667[source]
Did they respond differently depending on what race they thought you were? I'm surprised they would even do that honestly. I thought they were trained on text conversations which presumably wouldn't have any of that to learn from.
replies(4): >>45656799 #>>45656985 #>>45657478 #>>45664768 #
5. bigzyg33k ◴[] No.45656700[source]
advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.
replies(3): >>45656930 #>>45656981 #>>45656999 #
6. OisinMoran ◴[] No.45656799[source]
You can often tell where someone is from from text alone! There are plenty of idiosyncrasies even in how different English speaking countries use the language.
replies(2): >>45656828 #>>45657486 #
7. anotherhue ◴[] No.45656828{3}[source]
Ah stop
8. cubefox ◴[] No.45656930{3}[source]
But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.
replies(1): >>45657713 #
9. sbrother ◴[] No.45656981{3}[source]
right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.
10. thwarted ◴[] No.45656985[source]
If it did, it responded based on the accent it picked up on not race, because race and accent are orthogonal, correlation does not imply causation.
replies(1): >>45659653 #
11. oezi ◴[] No.45656999{3}[source]
Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.
replies(1): >>45662122 #
12. 85392_school ◴[] No.45657007[source]
I've also had luck with Gemini. If I made a few noises and asked which one was higher pitched, it could easily tell.
13. vvolhejn ◴[] No.45657291[source]
Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section.

Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.

replies(5): >>45657465 #>>45657812 #>>45657913 #>>45661357 #>>45662630 #
14. j45 ◴[] No.45657465[source]
Accent detection or consciously ignoring it is a filter step.
15. j45 ◴[] No.45657478[source]
There are subtle differences in language where two groups can be speaking English and one is having a completely different conversation without saying much.
replies(1): >>45659642 #
16. fragmede ◴[] No.45657486{3}[source]
Like, what do you mean? Are there, like, particular mannerisms that people from some regions that are hella unique to those regions?
replies(5): >>45657662 #>>45657679 #>>45660771 #>>45660775 #>>45664749 #
17. robotresearcher ◴[] No.45657662{4}[source]
I say old chap, what colour are your mummy’s wellies?
18. ctxc ◴[] No.45657679{4}[source]
Clever!
19. bigzyg33k ◴[] No.45657713{4}[source]
we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.

I don't find the articles argument that this is due to tokenisation convincing.

replies(1): >>45658310 #
20. JoshTriplett ◴[] No.45657812[source]
Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.
21. oezi ◴[] No.45657913[source]
> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion

replies(1): >>45657956 #
22. jasonjayr ◴[] No.45657956{3}[source]
IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.
replies(1): >>45659615 #
23. cubefox ◴[] No.45658310{5}[source]
They didn't say it's due to tokenization.

> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.

24. bongodongobob ◴[] No.45658995[source]
Hmm, the last time I played with GPT voice mode it was able to do all kinds of different accents.
replies(1): >>45666270 #
25. dotancohen ◴[] No.45659615{4}[source]
Where can I read about this?
replies(1): >>45664506 #
26. dotancohen ◴[] No.45659642{3}[source]
This is quite the reason my wife evolved into my ex-wife.
27. dotancohen ◴[] No.45659653{3}[source]
Are denying that race and accent are highly correlated?
replies(1): >>45663273 #
28. ◴[] No.45660771{4}[source]
29. ElevenLathe ◴[] No.45660775{4}[source]
You betcha!
30. smusamashah ◴[] No.45661357[source]
There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.
31. fragmede ◴[] No.45662122{4}[source]
Which "it" are you referring to? There are models that can.
32. wordglyph ◴[] No.45662630[source]
I used aistudio and it understood pitch and and even emotion with an uploaded mp3
33. thwarted ◴[] No.45663273{4}[source]
No, I'm saying that it is more meaningful to use what is directly derived rather than what is an indirect assumption. There is already issues with people erroneously considering whatever LLMs output as truth, the last thing anyone needs is an LLM claiming someone like Idris Elba is a white Briton because of his accent. We don't need automated phrenology machines, and that's what "determined your race from your voice" is pretty close to.
34. jasonjayr ◴[] No.45664506{5}[source]
> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

replies(1): >>45667145 #
35. xwolfi ◴[] No.45664749{4}[source]
All my Indian colleagues say "I agree with the same", this "the same" turn of phrase was so strange to me I had to ask (I'm French, so I have my own silly quirks, like I forget non-vocal plural(s<-- see, often I don't write that s)). They told me it was like that in Hindi so they just reproduce the pattern and it's grammatically acceptable.

For French people like me, false friends are immediately noticeable: for instance, "actually" to mean "now" instead of "in fact".

replies(1): >>45677405 #
36. vessenes ◴[] No.45664768[source]
Pre-nerf the 4o voice model had a wide range of expressivity, and it would match affect (still tries to do this) and idiolect of listeners if asked. Nowadays there's a list of accents that are considered "hate-ish" and a list that aren't.

I will elide the rant inside me that west coast 20 somethings get to decide if speaking in a certain accent is racist or "bad". But it's a heartfelt rant.

37. userbinator ◴[] No.45665432[source]
accent matching (if you sound Indian, it shouldn't also sound Indian)

Why not? I've found that it helps greatly with mutual intelligibility when both sides are speaking a similar dialect, and the one who can do this switching, switches to that of the one who can't.

(I wish I could also use an Indian accent bidirectionally; would definitely come in handy for those aggravating times I've had to talk to an outsourced customer service department.)

replies(1): >>45666238 #
38. philipallstar ◴[] No.45666238[source]
This isn't a logic thing, it's a politeness law invented by white liberal western women thing.
replies(1): >>45670543 #
39. phrotoma ◴[] No.45666270[source]
Like others I noticed this capability was interfered with in some way. I had fun getting it to speak to me in a cheesy over-the-top Bostonian accent early on, then one day when I tried to demonstrate for a friend it interrupted itself mid-sentence, literally one voice speaking over the other truncated voice, saying something like "I'm sorry I can't mimic voices".

It seemed like they had one model monitoring the output of another model and then cutting it off when it crossed some line.

40. vvolhejn ◴[] No.45667145{6}[source]
I had no idea this existed, the internet is amazing
41. xp84 ◴[] No.45670543{3}[source]
Exactly this. It fits that worldview to think a literal computer was “being racist” and mocking the user, even just by copying their speech patterns accurately.
42. Xmd5a ◴[] No.45677405{5}[source]
>like I forget non-vocal plural(s<-- see, often I don't write that s)).

c'est infernal. infernal