Yes.
> It seems like like their focus is largely on text to speech and speech to text.
They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.
https://elevenlabs.io/docs/agents-platform/overview#architec...
You would need:
* A STT (ASR) model that outputs phonetics not just words
* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc
* A TTS model that understands those tokens and properly generate the matching voice
At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.
I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.
But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").
The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.
To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.
So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.
Gemini responds in what I think is Spanish, or perhaps Portuguese.
However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.
That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.
Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.
a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.
And of course Grok's unhinged persona is... something else.
Non vere, sed intelligere possum.
Ita, mihi est canis qui idipsum facit!
(translated from the Gàidhlig)
> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.
Tracked down the original source [2] and looked for additional updates but couldn't find anything.
[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...
Also it being right doesn't mean it didn't just make up the answer.