Good with numbers mostly!
“I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“
Phone numbers and others were read nicely, but apparently a string of alphanumerics for an order number aren't handled well yet.
“I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“
I thought this was included in the demo, it seemed okay!
Also not the end of the world to process stuff like this with a regex.
Most of these newer TTS models require this type of formatting to reliably state long strings of numbers and IDs
[1] https://github.com/milosgajdos/go-playht [2] https://github.com/milosgajdos/playht_rs
I tried building a babelfish with o1, but the transcription in languages other than English are useless. When it gets it correct, the translations are pretty perfect and the voice responses are super fast, but without good transcription it's kind of useless. So close!
So its both hard to know what category you'd like to hear about, as well as if you do mean transcription, what your baseline is.
Whisper is widely regarded the best in the free camp, but I wouldn't be surprised to see a paper of a model claiming better WER, or a much bigger model.
If you meant you tried realtime 4o from OpenAI, and not o1*, it uses whisper for transcription on server, so I don't think you'll see much gain from trying whisper. my next try would be the Google Cloud APIs, but they're paid and with regard to your question re: open source SOTA, the underlying model isn't open.
But also if you did mean 4o, the transcription shouldn't matter for output transcription quality, the model is taking in voice (I verified their claim by noticing when there's errors in the transcription, it answers correctly)
* I keep messing these two up when talking about it, and it seems unlikely you meant o1 because it has a long synchronous delay before any part of the answer is available, and doesn't take in audio.
If you did mean o1, then, I'd use realtime 4o for TTS, and have it natively do the translation, as it will be unaffected by errors in transcription like you're facing now
Only thing missing (for me) is "emotion tokens" instead of forcing the entire generation to be with a specific emotion, as the generated voice is a bit too robotic otherwise.
The phone numbers were not naturally read at all. A human would have read a grouping of 123-456-789 like "123", "456", "789", but instead the model generated something like "123", "45", "6789". Listen to the RVSP example again and you'll know what I mean. The pacing is generally off for normal text too, but extra noticeable for the numbers.
My hunch would be that it's because of tokenization, but I wouldn't be able to say that's the issue for sure. Sounds like it though :)
Do you have link to your obsidian TTS plugin?
Yeah that's not gonna be realtime. It's really odd that we currently have two options, ViTS/Piper that runs at a ludicrous speed on a CPU and is kinda ok, and these slightly more natural versions a la StyleTTS2 that take 2 minutes to generate a sentence with CUDA acceleration.
Like, is there a middle ground? Maybe inverting one of the smaller whispers or something.
How does that end up in an announcement? Do people not notice, or not care? Or are they trying to show realistic mistakes?
Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.
Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)
[1] https://cartesia.ai/sonic [2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886
Very good voice cloning capability. Runs under 10G vram nvidia gpu.
I use OpenAI's voice models a lot and I have access to them all and I'm honestly more impressed with the ease at which one can conduct a conversation with this voice model.
Honestly, this feels like the first voice model I would pilot as a customer service rep in a hospitality setting.
In general, TTS APIs seem to run with much higher margins than LLMs from what I know.
However, it cannot be used for the same use case because it’s currently very slow, so real time usage is not yet possible with the current release code, in spite of the 0.15 RTF claimed in the paper.
Source: https://en.wikipedia.org/wiki/File:StatCounter-browser-ww-mo...
https://github.com/SYSTRAN/faster-whisper (speech-to-text) https://github.com/rhasspy/piper (text-to-speech)
I suppose it might be possible to do it with streaming very short segments, but I haven't seen any implementation with it that would allow for that, and with diffusion based models it doesn't even work conceptually either.
I believe verifying numbers up to at least 100000 should be a requirement for new TTS models.
No, the audio was OK or even good. The example seems to be an automated response from some system where a human has just placed an order. The order number is A123B567Z890X but if we want our system to "read back" the order number we apparently have to specially format the text. I suppose for the clarifying stuff "Alpha Bravo" that's a good idea, but separating digits and all those commas?