←back to thread

425 points karimf | 2 comments | | HN request time: 0s | source
Show context
miki123211 ◴[] No.45656279[source]
> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #
sbrother ◴[] No.45656408[source]
I don't think it's just safeguards; they really don't seem to understand pitch at all. I tried asking ChatGPT's advanced voice mode to recognize a tune I was humming, and it insisted it was Beethoven's 5th -- multiple times. I think it must have basically tokenized my humming to "dun dun dun duuun".
replies(1): >>45656700 #
bigzyg33k ◴[] No.45656700[source]
advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.
replies(3): >>45656930 #>>45656981 #>>45656999 #
1. oezi ◴[] No.45656999{3}[source]
Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.
replies(1): >>45662122 #
2. fragmede ◴[] No.45662122[source]
Which "it" are you referring to? There are models that can.