←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 1 comments | | HN request time: 0.2s | source
Show context
zug_zug ◴[] No.46235131[source]
For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?
replies(15): >>46235139 #>>46235151 #>>46235193 #>>46235277 #>>46235779 #>>46236133 #>>46236236 #>>46236283 #>>46236341 #>>46236399 #>>46236665 #>>46236951 #>>46237061 #>>46237082 #>>46237617 #
Robdel12 ◴[] No.46235193[source]
I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.
replies(1): >>46235258 #
lxgr ◴[] No.46235258[source]
Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.
replies(2): >>46235357 #>>46236680 #
sosodev ◴[] No.46235357[source]
You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.
replies(2): >>46235438 #>>46236201 #
lxgr ◴[] No.46235438[source]
Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?
replies(3): >>46235639 #>>46235768 #>>46235803 #
1. sosodev ◴[] No.46235768[source]
Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.