VibeVoice: A Frontier Open-Source Text-to-Speech Model

(microsoft.github.io)

448 points lastdong | 2 comments | 03 Sep 25 10:44 UTC | HN request time: 0.001s | source

Show context

simiones ◴[03 Sep 25 12:23 UTC] No.45114884[source]▶

I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.

The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.

The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).

And, of course, the singing part is painfully bad, I am very curious why they even included it.

replies(11): >>45114973 #>>45115076 #>>45115109 #>>45115714 #>>45115907 #>>45116238 #>>45116262 #>>45116513 #>>45117982 #>>45119535 #>>45122185 #

jstummbillig ◴[03 Sep 25 14:28 UTC] No.45116238[source]▶

>>45114884 #

Is there any better model you can point at? I would be interested in having a listen.

There are people – and it does not matter what it's about – that will overstate the progress made (and others will understate it, case in point). Neither should put a damper on progress. This is the best I personally have heard so far, but I certainly might have missed something.

replies(7): >>45116434 #>>45116663 #>>45117083 #>>45117999 #>>45120502 #>>45122997 #>>45124570 #

Uehreka ◴[03 Sep 25 14:45 UTC] No.45116434[source]▶

>>45116238 #

It’s tough to name the best local TTS since they all seem to trade off on quality and features and none of them are as good as ElevenLabs’ closed-source offering.

However Kokoro-82M is an absolute triumph in the small model space. It curbstomps models 10-20x its size in terms of quality while also being runnable on like, a Raspberry Pi. It’s the kind of thing I’m surprised even exists. Its downside is that it isn’t super expressive, but the af_heart voice is extremely clean, and Kokoro is way more reliable than other TTS models: It doesn’t have the common failure mode where you occasionally have a couple extra syllables thrown in because you picked a bad seed.

If you want something that can do convincing voice acting, either pay for ElevenLabs or keep waiting. If you’re trying to build a local AI assistant, Kokoro is perfect, just use that and check the space again in like 6 months to see if something’s beaten it. https://huggingface.co/hexgrad/Kokoro-82M

replies(2): >>45117058 #>>45117949 #

1. sandreas ◴[03 Sep 25 16:48 UTC] No.45117949[source]▶

>>45116434 #

What is your opinion about F5-TTS or Fish-TTS?

replies(1): >>45123530 #

2. brettpro ◴[04 Sep 25 04:17 UTC] No.45123530[source]▶

>>45117949 (TP) #

I recently implemented Fish for a project and found it adequate for TTS but wildly impressive in voice cloning. My POC originally required 3-10 audio samples but I removed the minimum because it could usually one shot it.

The model is good, but I will say their inference code leaves a lot to be desired. I had to rewrite large portions of it for simple things like correct chunking and streaming. The advertised expressive keywords are very much hit and miss, and the devs have gone dark unfortunately.

↑