Play 3.0 mini – A lightweight, reliable, cost-efficient Multilingual TTS model

1. Mizza ◴[14 Oct 24 21:20 UTC] No.41842147[source]▶

What's SOTA for open source or on-device right now?

I tried building a babelfish with o1, but the transcription in languages other than English are useless. When it gets it correct, the translations are pretty perfect and the voice responses are super fast, but without good transcription it's kind of useless. So close!

replies(5): >>41842153 #>>41842200 #>>41842281 #>>41843179 #>>41846783 #

2. amrrs ◴[14 Oct 24 21:20 UTC] No.41842153[source]▶

>>41842147 (TP) #

have you tried Moshi - https://huggingface.co/collections/kyutai/moshi-v01-release-...

3. refulgentis ◴[14 Oct 24 21:25 UTC] No.41842200[source]▶

>>41842147 (TP) #

I'm not sure what you mean fully, this is TTS, but it sounds like you're expecting an answer about transcription

So its both hard to know what category you'd like to hear about, as well as if you do mean transcription, what your baseline is.

Whisper is widely regarded the best in the free camp, but I wouldn't be surprised to see a paper of a model claiming better WER, or a much bigger model.

If you meant you tried realtime 4o from OpenAI, and not o1*, it uses whisper for transcription on server, so I don't think you'll see much gain from trying whisper. my next try would be the Google Cloud APIs, but they're paid and with regard to your question re: open source SOTA, the underlying model isn't open.

But also if you did mean 4o, the transcription shouldn't matter for output transcription quality, the model is taking in voice (I verified their claim by noticing when there's errors in the transcription, it answers correctly)

* I keep messing these two up when talking about it, and it seems unlikely you meant o1 because it has a long synchronous delay before any part of the answer is available, and doesn't take in audio.

If you did mean o1, then, I'd use realtime 4o for TTS, and have it natively do the translation, as it will be unaffected by errors in transcription like you're facing now

replies(1): >>41846326 #

4. diggan ◴[14 Oct 24 21:34 UTC] No.41842281[source]▶

>>41842147 (TP) #

I was literally just looking at that today, and the best one I came across was F5-TTS: https://swivid.github.io/F5-TTS/

Only thing missing (for me) is "emotion tokens" instead of forcing the entire generation to be with a specific emotion, as the generated voice is a bit too robotic otherwise.

replies(1): >>41842581 #

5. moffkalast ◴[14 Oct 24 22:08 UTC] No.41842581[source]▶

>>41842281 #

> based on flow matching with Diffusion Transformer

Yeah that's not gonna be realtime. It's really odd that we currently have two options, ViTS/Piper that runs at a ludicrous speed on a CPU and is kinda ok, and these slightly more natural versions a la StyleTTS2 that take 2 minutes to generate a sentence with CUDA acceleration.

Like, is there a middle ground? Maybe inverting one of the smaller whispers or something.

replies(2): >>41842748 #>>41842974 #

6. modeless ◴[14 Oct 24 22:23 UTC] No.41842748{3}[source]▶

>>41842581 #

StyleTTS2 is faster than realtime

replies(1): >>41847018 #

7. gunalx ◴[14 Oct 24 22:45 UTC] No.41842974{3}[source]▶

>>41842581 #

Bark?

8. kabirgoel ◴[14 Oct 24 23:10 UTC] No.41843179[source]▶

>>41842147 (TP) #

I work at Cartesia, which operates a TTS API similar to Play [1]. I’d be willing to venture a guess and say that our TTS model, Sonic, is probably SoTA for on-device, but don't quote me on that claim. It's the same model that powers our API.

Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.

Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)

[1] https://cartesia.ai/sonic [2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886

replies(1): >>41862255 #

9. krageon ◴[15 Oct 24 08:33 UTC] No.41846326[source]▶

>>41842200 #

GP said local / on-device. Most of what you mentioned is cloud shit.

replies(1): >>41850855 #

10. jankovicsandras ◴[15 Oct 24 09:42 UTC] No.41846783[source]▶

>>41842147 (TP) #

Hi, I don't know what's SOTA, but I got good results with these (open source, on-device) :

https://github.com/SYSTRAN/faster-whisper (speech-to-text) https://github.com/rhasspy/piper (text-to-speech)

11. moffkalast ◴[15 Oct 24 10:20 UTC] No.41847018{4}[source]▶

>>41842748 #

To be clear, what I mean by realtime is full gen under at most 200ms so it can be sent to the sound card and start playing, not generating under the amount of time it would take to play it, which would add that as an unusably long delay in practice.

I suppose it might be possible to do it with streaming very short segments, but I haven't seen any implementation with it that would allow for that, and with diffusion based models it doesn't even work conceptually either.

12. refulgentis ◴[15 Oct 24 17:25 UTC] No.41850855{3}[source]▶

>>41846326 #

Yeah I covered on device. Okay, lets call the rest cloud shit. Yeah, like I said, confusing comment. They said open source and on device and talked about the quality issues with the cloud shit they're using that certainly won't be resolved by using on device models. shrug

13. pietz ◴[16 Oct 24 18:36 UTC] No.41862255[source]▶

>>41843179 #

Is your model really open source or did you misunderstand the question?