GPT-5.2 | slacker news

1. zug_zug ◴[11 Dec 25 18:34 UTC] No.46235131[source]▶

For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?

replies(15): >>46235139 #>>46235151 #>>46235193 #>>46235277 #>>46235779 #>>46236133 #>>46236236 #>>46236283 #>>46236341 #>>46236399 #>>46236665 #>>46236951 #>>46237061 #>>46237082 #>>46237617 #

2. FrasiertheLion ◴[11 Dec 25 18:34 UTC] No.46235139[source]▶

>>46235131 (TP) #

Try elevenlabs

replies(1): >>46235296 #

3. bigyabai ◴[11 Dec 25 18:35 UTC] No.46235151[source]▶

>>46235131 (TP) #

Qwen does.

replies(1): >>46235243 #

4. Robdel12 ◴[11 Dec 25 18:38 UTC] No.46235193[source]▶

>>46235131 (TP) #

I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.

replies(1): >>46235258 #

5. sosodev ◴[11 Dec 25 18:42 UTC] No.46235243[source]▶

>>46235151 #

Qwen's voice chat is nowhere near as good as ChatGPT's.

6. lxgr ◴[11 Dec 25 18:43 UTC] No.46235258[source]▶

>>46235193 #

Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.

replies(2): >>46235357 #>>46236680 #

7. codybontecou ◴[11 Dec 25 18:44 UTC] No.46235277[source]▶

>>46235131 (TP) #

Their voice agent is handy. Currently trying to build around it.

8. sosodev ◴[11 Dec 25 18:45 UTC] No.46235296[source]▶

>>46235139 #

Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.

replies(2): >>46235524 #>>46236377 #

9. sosodev ◴[11 Dec 25 18:48 UTC] No.46235357{3}[source]▶

>>46235258 #

You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.

replies(2): >>46235438 #>>46236201 #

10. lxgr ◴[11 Dec 25 18:53 UTC] No.46235438{4}[source]▶

>>46235357 #

Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?

replies(3): >>46235639 #>>46235768 #>>46235803 #

11. dragonwriter ◴[11 Dec 25 18:58 UTC] No.46235524{3}[source]▶

>>46235296 #

> Does elevenlabs have a real-time conversational voice model?

Yes.

> It seems like like their focus is largely on text to speech and speech to text.

They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.

replies(2): >>46235600 #>>46235719 #

12. sosodev ◴[11 Dec 25 19:02 UTC] No.46235600{4}[source]▶

>>46235524 #

It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS

https://elevenlabs.io/docs/agents-platform/overview#architec...

13. barrkel ◴[11 Dec 25 19:04 UTC] No.46235639{5}[source]▶

>>46235438 #

The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.

14. ◴[11 Dec 25 19:09 UTC] No.46235719{4}[source]▶

>>46235524 #

15. sosodev ◴[11 Dec 25 19:13 UTC] No.46235768{5}[source]▶

>>46235438 #

Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

16. websiteapi ◴[11 Dec 25 19:13 UTC] No.46235779[source]▶

>>46235131 (TP) #

gemini live is a thing - never tried chaptgpt, are they not similar?

replies(2): >>46235984 #>>46236261 #

17. jablongo ◴[11 Dec 25 19:15 UTC] No.46235803{5}[source]▶

>>46235438 #

I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.

18. jeanlucas ◴[11 Dec 25 19:26 UTC] No.46235984[source]▶

>>46235779 #

no.

replies(2): >>46236073 #>>46236152 #

19. leaK_u ◴[11 Dec 25 19:34 UTC] No.46236073{3}[source]▶

>>46235984 #

how.

replies(1): >>46236120 #

20. CamelCaseName ◴[11 Dec 25 19:38 UTC] No.46236120{4}[source]▶

>>46236073 #

I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.

I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.

21. tmaly ◴[11 Dec 25 19:39 UTC] No.46236133[source]▶

>>46235131 (TP) #

I can't keep up with half the new features all the model companies keep rolling out. I wish they would solve that

22. nickvec ◴[11 Dec 25 19:41 UTC] No.46236152{3}[source]▶

>>46235984 #

What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.

23. ◴[11 Dec 25 19:46 UTC] No.46236201{4}[source]▶

>>46235357 #

24. sundarurfriend ◴[11 Dec 25 19:49 UTC] No.46236236[source]▶

>>46235131 (TP) #

Are you saying ChatGPT's voice chat is of good quality? Because for me it's one of its most frustrating weaknesses. I vastly prefer voice input to typing, and would love it if the voice chat mode actually worked well.

But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").

The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.

To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.

So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.

25. spudlyo ◴[11 Dec 25 19:51 UTC] No.46236261[source]▶

>>46235779 #

Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.

Gemini responds in what I think is Spanish, or perhaps Portuguese.

However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.

replies(1): >>46236534 #

26. semiinfinitely ◴[11 Dec 25 19:53 UTC] No.46236283[source]▶

>>46235131 (TP) #

try gemini voice chat

27. ivape ◴[11 Dec 25 19:59 UTC] No.46236341[source]▶

>>46235131 (TP) #

I'm a big user of Gemini voice. My sense is that Gemini voice uses very tight system prompts that are designed to give you an answer and kind of get you off the phone as much as possible. It doesn't have large context at all.

That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.

Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.

28. hi_im_vijay ◴[11 Dec 25 20:02 UTC] No.46236377{3}[source]▶

>>46235296 #

[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.

a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.

replies(1): >>46239343 #

29. whimsicalism ◴[11 Dec 25 20:03 UTC] No.46236399[source]▶

>>46235131 (TP) #

gemini does, grok does, nobody else does (except alibaba but it’s not there yet)

30. Dilettante_ ◴[11 Dec 25 20:14 UTC] No.46236534{3}[source]▶

>>46236261 #

I have to ask: What usecase requires you to speak Latin to the llm?

replies(2): >>46236612 #>>46237315 #

31. spudlyo ◴[11 Dec 25 20:20 UTC] No.46236612{4}[source]▶

>>46236534 #

I'm a Latin language learner, and part of developing fluency is practicing extemporaneous speech. My dog is a patient listener, but a poor interlocutor. There are Latin language Discord servers where you can speak to people, but I don't quite have the confidence to do that yet. I assume the machine doesn't judge my shitty grammar.

replies(1): >>46236839 #

32. joshmarlow ◴[11 Dec 25 20:25 UTC] No.46236665[source]▶

>>46235131 (TP) #

I think Grok's voice chat is almost there - only things missing for me: * it's slower to start-up by a couple of seconds * it's harder to switch between voice and text and back again in the same chat (though ChatGPT isn't perfect at this either)

And of course Grok's unhinged persona is... something else.

replies(3): >>46236739 #>>46236910 #>>46237656 #

33. causalmodels ◴[11 Dec 25 20:27 UTC] No.46236680{3}[source]▶

>>46235258 #

I just asked it and it said that it uses the on device TTS capabilities.

replies(1): >>46237191 #

34. nazgulsenpai ◴[11 Dec 25 20:32 UTC] No.46236739[source]▶

>>46236665 #

It's so much fun. So is the Conspiracy persona.

35. onraglanroad ◴[11 Dec 25 20:40 UTC] No.46236839{5}[source]▶

>>46236612 #

Loquerisne Latine?

Non vere, sed intelligere possum.

Ita, mihi est canis qui idipsum facit!

(translated from the Gàidhlig)

replies(1): >>46237022 #

36. hbarka ◴[11 Dec 25 20:49 UTC] No.46236951[source]▶

>>46235131 (TP) #

On the contrary, I thought Gemini 3 Live mode is much much better than ChatGPT. The voices have none of the annoying artificial uptalking intonations that ChatGPT has, and the simplex/duplex interruptibility of Gemini Live seems more responsive. It knows when to break and pause during conversations.

replies(1): >>46238724 #

37. spudlyo ◴[11 Dec 25 20:55 UTC] No.46237022{6}[source]▶

>>46236839 #

Certe loqui conor, sed saepenumero prave dico; canis meus non turbatus est ;)

38. simondotau ◴[11 Dec 25 20:58 UTC] No.46237061[source]▶

>>46235131 (TP) #

I absolutely loathe ChatGPT's voice chat. It spends far too much time being conversational and its eagerness to please becomes fatiguing after the first back-and-forth.

39. josephwegner ◴[11 Dec 25 21:00 UTC] No.46237082[source]▶

>>46235131 (TP) #

Along with the hordes of other options people are responding with, I'm a big fan of Perplexity's voice chat. It does back-and-forth well in a way that I missed whenever I tried anything besides ChatGPT.

replies(1): >>46237200 #

40. furyofantares ◴[11 Dec 25 21:08 UTC] No.46237191{4}[source]▶

>>46236680 #

I find it very unlikely that it would be trained on that information or that anthropic would put that in its context window, so it's very likely that it just made that answer up.

replies(1): >>46237402 #

41. solarkraft ◴[11 Dec 25 21:09 UTC] No.46237200[source]▶

>>46237082 #

It is, shockingly, based on the OpenAI Realtime Assistant API.

42. nineteen999 ◴[11 Dec 25 21:20 UTC] No.46237315{4}[source]▶

>>46236534 #

You haven't heard? Latin is the next big wave, after blockchain and AI.

replies(1): >>46237546 #

43. causalmodels ◴[11 Dec 25 21:28 UTC] No.46237402{5}[source]▶

>>46237191 #

No, it did not make it up. I was curious so I asked it asked it to imitate a posh British accent imitating a South Brooklyn accent while having a head cold and it explained that it didn't have have fine grained control over the audio output because it was using a TTS. I asked it how it knew that and it pointed me towards [1] and highlighted the following.

> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.

Tracked down the original source [2] and looked for additional updates but couldn't find anything.

[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...

[2] https://trust.anthropic.com/updates

replies(1): >>46237525 #

44. furyofantares ◴[11 Dec 25 21:37 UTC] No.46237525{6}[source]▶

>>46237402 #

If it does a web search that's fine, I assumed it hadn't since you hadn't linked to anything.

Also it being right doesn't mean it didn't just make up the answer.

45. spudlyo ◴[11 Dec 25 21:39 UTC] No.46237546{5}[source]▶

>>46237315 #

You laugh, but the global language learning market in 2025 is expected to exceed USD $100 billion, and LLMs IMHO are poised to disrupt the shit out of it.

replies(1): >>46241967 #

46. SweetSoftPillow ◴[11 Dec 25 21:44 UTC] No.46237617[source]▶

>>46235131 (TP) #

Gemini's much better, try it

47. Gigachad ◴[11 Dec 25 21:47 UTC] No.46237656[source]▶

>>46236665 #

Pretty good until it goes crazy glazing Elon or declaring itself mecha hitler.

replies(1): >>46238626 #

48. hcurtiss ◴[11 Dec 25 23:09 UTC] No.46238626{3}[source]▶

>>46237656 #

Neither of these have happened in my use. Those were both the product of some pretty aggressive prompting, and were remedied months ago.

replies(1): >>46241856 #

49. febed ◴[11 Dec 25 23:18 UTC] No.46238724[source]▶

>>46236951 #

Apart from sounding a bit stiff and informal, I was also surprised at how good Gemini Live mode is in regional Indian languages.

50. OrangeMusic ◴[12 Dec 25 07:53 UTC] No.46241856{4}[source]▶

>>46238626 #

Yet, using this model in any way whatsoever after these episodes seems absolutely crazy to me.

replies(1): >>46242439 #

51. nineteen999 ◴[12 Dec 25 08:14 UTC] No.46241967{6}[source]▶

>>46237546 #

Well sure I can see that happening ... but I can't see latin making a huge comeback unfortunately.

52. user34283 ◴[12 Dec 25 09:36 UTC] No.46242439{5}[source]▶

>>46241856 #

Grok is the only frontier model that is at all usable for adult content.