Most active commenters
  • sosodev(5)
  • spudlyo(4)

←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 52 comments | | HN request time: 0.632s | source | bottom
1. zug_zug ◴[] No.46235131[source]
For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?
replies(15): >>46235139 #>>46235151 #>>46235193 #>>46235277 #>>46235779 #>>46236133 #>>46236236 #>>46236283 #>>46236341 #>>46236399 #>>46236665 #>>46236951 #>>46237061 #>>46237082 #>>46237617 #
2. FrasiertheLion ◴[] No.46235139[source]
Try elevenlabs
replies(1): >>46235296 #
3. bigyabai ◴[] No.46235151[source]
Qwen does.
replies(1): >>46235243 #
4. Robdel12 ◴[] No.46235193[source]
I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.
replies(1): >>46235258 #
5. sosodev ◴[] No.46235243[source]
Qwen's voice chat is nowhere near as good as ChatGPT's.
6. lxgr ◴[] No.46235258[source]
Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.
replies(2): >>46235357 #>>46236680 #
7. codybontecou ◴[] No.46235277[source]
Their voice agent is handy. Currently trying to build around it.
8. sosodev ◴[] No.46235296[source]
Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.
replies(2): >>46235524 #>>46236377 #
9. sosodev ◴[] No.46235357{3}[source]
You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.
replies(2): >>46235438 #>>46236201 #
10. lxgr ◴[] No.46235438{4}[source]
Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?
replies(3): >>46235639 #>>46235768 #>>46235803 #
11. dragonwriter ◴[] No.46235524{3}[source]
> Does elevenlabs have a real-time conversational voice model?

Yes.

> It seems like like their focus is largely on text to speech and speech to text.

They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.

replies(2): >>46235600 #>>46235719 #
12. sosodev ◴[] No.46235600{4}[source]
It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS

https://elevenlabs.io/docs/agents-platform/overview#architec...

13. barrkel ◴[] No.46235639{5}[source]
The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.
14. ◴[] No.46235719{4}[source]
15. sosodev ◴[] No.46235768{5}[source]
Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

16. websiteapi ◴[] No.46235779[source]
gemini live is a thing - never tried chaptgpt, are they not similar?
replies(2): >>46235984 #>>46236261 #
17. jablongo ◴[] No.46235803{5}[source]
I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.
18. jeanlucas ◴[] No.46235984[source]
no.
replies(2): >>46236073 #>>46236152 #
19. leaK_u ◴[] No.46236073{3}[source]
how.
replies(1): >>46236120 #
20. CamelCaseName ◴[] No.46236120{4}[source]
I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.

I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.

21. tmaly ◴[] No.46236133[source]
I can't keep up with half the new features all the model companies keep rolling out. I wish they would solve that
22. nickvec ◴[] No.46236152{3}[source]
What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.
23. ◴[] No.46236201{4}[source]
24. sundarurfriend ◴[] No.46236236[source]
Are you saying ChatGPT's voice chat is of good quality? Because for me it's one of its most frustrating weaknesses. I vastly prefer voice input to typing, and would love it if the voice chat mode actually worked well.

But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").

The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.

To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.

So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.

25. spudlyo ◴[] No.46236261[source]
Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.

Gemini responds in what I think is Spanish, or perhaps Portuguese.

However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.

replies(1): >>46236534 #
26. semiinfinitely ◴[] No.46236283[source]
try gemini voice chat
27. ivape ◴[] No.46236341[source]
I'm a big user of Gemini voice. My sense is that Gemini voice uses very tight system prompts that are designed to give you an answer and kind of get you off the phone as much as possible. It doesn't have large context at all.

That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.

Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.

28. hi_im_vijay ◴[] No.46236377{3}[source]
[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.

a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.

replies(1): >>46239343 #
29. whimsicalism ◴[] No.46236399[source]
gemini does, grok does, nobody else does (except alibaba but it’s not there yet)
30. Dilettante_ ◴[] No.46236534{3}[source]
I have to ask: What usecase requires you to speak Latin to the llm?
replies(2): >>46236612 #>>46237315 #
31. spudlyo ◴[] No.46236612{4}[source]
I'm a Latin language learner, and part of developing fluency is practicing extemporaneous speech. My dog is a patient listener, but a poor interlocutor. There are Latin language Discord servers where you can speak to people, but I don't quite have the confidence to do that yet. I assume the machine doesn't judge my shitty grammar.
replies(1): >>46236839 #
32. joshmarlow ◴[] No.46236665[source]
I think Grok's voice chat is almost there - only things missing for me: * it's slower to start-up by a couple of seconds * it's harder to switch between voice and text and back again in the same chat (though ChatGPT isn't perfect at this either)

And of course Grok's unhinged persona is... something else.

replies(3): >>46236739 #>>46236910 #>>46237656 #
33. causalmodels ◴[] No.46236680{3}[source]
I just asked it and it said that it uses the on device TTS capabilities.
replies(1): >>46237191 #
34. nazgulsenpai ◴[] No.46236739[source]
It's so much fun. So is the Conspiracy persona.
35. onraglanroad ◴[] No.46236839{5}[source]
Loquerisne Latine?

Non vere, sed intelligere possum.

Ita, mihi est canis qui idipsum facit!

(translated from the Gàidhlig)

replies(1): >>46237022 #
36. hbarka ◴[] No.46236951[source]
On the contrary, I thought Gemini 3 Live mode is much much better than ChatGPT. The voices have none of the annoying artificial uptalking intonations that ChatGPT has, and the simplex/duplex interruptibility of Gemini Live seems more responsive. It knows when to break and pause during conversations.
replies(1): >>46238724 #
37. spudlyo ◴[] No.46237022{6}[source]
Certe loqui conor, sed saepenumero prave dico; canis meus non turbatus est ;)
38. simondotau ◴[] No.46237061[source]
I absolutely loathe ChatGPT's voice chat. It spends far too much time being conversational and its eagerness to please becomes fatiguing after the first back-and-forth.
39. josephwegner ◴[] No.46237082[source]
Along with the hordes of other options people are responding with, I'm a big fan of Perplexity's voice chat. It does back-and-forth well in a way that I missed whenever I tried anything besides ChatGPT.
replies(1): >>46237200 #
40. furyofantares ◴[] No.46237191{4}[source]
I find it very unlikely that it would be trained on that information or that anthropic would put that in its context window, so it's very likely that it just made that answer up.
replies(1): >>46237402 #
41. solarkraft ◴[] No.46237200[source]
It is, shockingly, based on the OpenAI Realtime Assistant API.
42. nineteen999 ◴[] No.46237315{4}[source]
You haven't heard? Latin is the next big wave, after blockchain and AI.
replies(1): >>46237546 #
43. causalmodels ◴[] No.46237402{5}[source]
No, it did not make it up. I was curious so I asked it asked it to imitate a posh British accent imitating a South Brooklyn accent while having a head cold and it explained that it didn't have have fine grained control over the audio output because it was using a TTS. I asked it how it knew that and it pointed me towards [1] and highlighted the following.

> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.

Tracked down the original source [2] and looked for additional updates but couldn't find anything.

[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...

[2] https://trust.anthropic.com/updates

replies(1): >>46237525 #
44. furyofantares ◴[] No.46237525{6}[source]
If it does a web search that's fine, I assumed it hadn't since you hadn't linked to anything.

Also it being right doesn't mean it didn't just make up the answer.

45. spudlyo ◴[] No.46237546{5}[source]
You laugh, but the global language learning market in 2025 is expected to exceed USD $100 billion, and LLMs IMHO are poised to disrupt the shit out of it.
replies(1): >>46241967 #
46. SweetSoftPillow ◴[] No.46237617[source]
Gemini's much better, try it
47. Gigachad ◴[] No.46237656[source]
Pretty good until it goes crazy glazing Elon or declaring itself mecha hitler.
replies(1): >>46238626 #
48. hcurtiss ◴[] No.46238626{3}[source]
Neither of these have happened in my use. Those were both the product of some pretty aggressive prompting, and were remedied months ago.
replies(1): >>46241856 #
49. febed ◴[] No.46238724[source]
Apart from sounding a bit stiff and informal, I was also surprised at how good Gemini Live mode is in regional Indian languages.
50. OrangeMusic ◴[] No.46241856{4}[source]
Yet, using this model in any way whatsoever after these episodes seems absolutely crazy to me.
replies(1): >>46242439 #
51. nineteen999 ◴[] No.46241967{6}[source]
Well sure I can see that happening ... but I can't see latin making a huge comeback unfortunately.
52. user34283 ◴[] No.46242439{5}[source]
Grok is the only frontier model that is at all usable for adult content.