You would need:
* A STT (ASR) model that outputs phonetics not just words
* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc
* A TTS model that understands those tokens and properly generate the matching voice
At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.
> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.
Tracked down the original source [2] and looked for additional updates but couldn't find anything.
[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...
Also it being right doesn't mean it didn't just make up the answer.