For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?
replies(15):
You would need:
* A STT (ASR) model that outputs phonetics not just words
* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc
* A TTS model that understands those tokens and properly generate the matching voice
At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.