←back to thread

425 points karimf | 10 comments | | HN request time: 1.407s | source | bottom
Show context
miki123211 ◴[] No.45656279[source]
> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #
1. vvolhejn ◴[] No.45657291[source]
Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section.

Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.

replies(5): >>45657465 #>>45657812 #>>45657913 #>>45661357 #>>45662630 #
2. j45 ◴[] No.45657465[source]
Accent detection or consciously ignoring it is a filter step.
3. JoshTriplett ◴[] No.45657812[source]
Audio models for speech not understanding pitch, seems similar to how text LLMs often don't understand spelling: it's not what they were trying to recognize.
4. oezi ◴[] No.45657913[source]
> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion

replies(1): >>45657956 #
5. jasonjayr ◴[] No.45657956[source]
IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.
replies(1): >>45659615 #
6. dotancohen ◴[] No.45659615{3}[source]
Where can I read about this?
replies(1): >>45664506 #
7. smusamashah ◴[] No.45661357[source]
There was an example, of ChatGPT copying and responding in the speakers voice mid conversation, on OpenAI blog. This was presented an example on non-alignment.
8. wordglyph ◴[] No.45662630[source]
I used aistudio and it understood pitch and and even emotion with an uploaded mp3
9. jasonjayr ◴[] No.45664506{4}[source]
> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

replies(1): >>45667145 #
10. vvolhejn ◴[] No.45667145{5}[source]
I had no idea this existed, the internet is amazing