GPT-5.2

(openai.com)

1019 points atgctg | 1 comments | 11 Dec 25 18:04 UTC | HN request time: 0.2s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

zug_zug ◴[11 Dec 25 18:34 UTC] No.46235131[source]▶

For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?

replies(15): >>46235139 #>>46235151 #>>46235193 #>>46235277 #>>46235779 #>>46236133 #>>46236236 #>>46236283 #>>46236341 #>>46236399 #>>46236665 #>>46236951 #>>46237061 #>>46237082 #>>46237617 #

Robdel12 ◴[11 Dec 25 18:38 UTC] No.46235193[source]▶

>>46235131 #

I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.

replies(1): >>46235258 #

lxgr ◴[11 Dec 25 18:43 UTC] No.46235258[source]▶

>>46235193 #

Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.

replies(2): >>46235357 #>>46236680 #

sosodev ◴[11 Dec 25 18:48 UTC] No.46235357[source]▶

>>46235258 #

You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.

replies(2): >>46235438 #>>46236201 #

lxgr ◴[11 Dec 25 18:53 UTC] No.46235438[source]▶

>>46235357 #

Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?

replies(3): >>46235639 #>>46235768 #>>46235803 #

1. sosodev ◴[11 Dec 25 19:13 UTC] No.46235768[source]▶

>>46235438 #

Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

↑