Most active commenters

selkin(3)
watsonmusic(3)
kadoban(3)

Popular/hot comments

>>45116238 #
>>45115907 #
>>45120715 #

←back to thread

VibeVoice: A Frontier Open-Source Text-to-Speech Model

(microsoft.github.io)

1. simiones ◴[03 Sep 25 12:23 UTC] No.45114884[source]▶

>>45114245 (OP) #

I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.

The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.

The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).

And, of course, the singing part is painfully bad, I am very curious why they even included it.

replies(11): >>45114973 #>>45115076 #>>45115109 #>>45115714 #>>45115907 #>>45116238 #>>45116262 #>>45116513 #>>45117982 #>>45119535 #>>45122185 #

2. rcarmo ◴[03 Sep 25 12:35 UTC] No.45114973[source]▶

>>45114884 (TP) #

One of the things this model is actually quite good at is voice cloning. Drop a recorded sample of your voice into the voices folder, and it just works.

replies(1): >>45116655 #

3. MengerSponge ◴[03 Sep 25 12:47 UTC] No.45115076[source]▶

>>45114884 (TP) #

> (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

4. IshKebab ◴[03 Sep 25 12:52 UTC] No.45115109[source]▶

>>45114884 (TP) #

I agree. For some reason the female voices are waaay more convincing than the male ones too, which sound barely better than speech synthesis from a decade ago.

replies(1): >>45116546 #

5. echelon ◴[03 Sep 25 13:44 UTC] No.45115714[source]▶

>>45114884 (TP) #

This is close to SOTA emotional performance, at least the female voices.

I trust the human scores in the paper. At least my ear aligns with that figure.

With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models.

replies(2): >>45116221 #>>45116592 #

6. mclau157 ◴[03 Sep 25 13:59 UTC] No.45115907[source]▶

>>45114884 (TP) #

ElevenLabs has a much more convincing voice model

replies(3): >>45116308 #>>45116309 #>>45116659 #

7. kamranjon ◴[03 Sep 25 14:26 UTC] No.45116221[source]▶

>>45115714 #

Hmmmm… what is your opinion on the examples showcased here vs the ones on the Dia demo page?

https://yummy-fir-7a4.notion.site/dia

I am not sure why but I find the pacing of the parakeet based models (like Dia) to be much more realistic.

8. jstummbillig ◴[03 Sep 25 14:28 UTC] No.45116238[source]▶

>>45114884 (TP) #

Is there any better model you can point at? I would be interested in having a listen.

There are people – and it does not matter what it's about – that will overstate the progress made (and others will understate it, case in point). Neither should put a damper on progress. This is the best I personally have heard so far, but I certainly might have missed something.

replies(7): >>45116434 #>>45116663 #>>45117083 #>>45117999 #>>45120502 #>>45122997 #>>45124570 #

9. odie5533 ◴[03 Sep 25 14:29 UTC] No.45116262[source]▶

>>45114884 (TP) #

It's good but not the best free model. I find Chatterbox to be more realistic with no robot-sounding and better (though not perfect) intonation.

replies(2): >>45118687 #>>45120756 #

10. DrBenCarson ◴[03 Sep 25 14:33 UTC] No.45116308[source]▶

>>45115907 #

Open source?

11. sys32768 ◴[03 Sep 25 14:33 UTC] No.45116309[source]▶

>>45115907 #

They also offer an AI Voice Changer that will take a recording and transform it into a different voice but retain the cadence and intonation.

12. Uehreka ◴[03 Sep 25 14:45 UTC] No.45116434[source]▶

>>45116238 #

It’s tough to name the best local TTS since they all seem to trade off on quality and features and none of them are as good as ElevenLabs’ closed-source offering.

However Kokoro-82M is an absolute triumph in the small model space. It curbstomps models 10-20x its size in terms of quality while also being runnable on like, a Raspberry Pi. It’s the kind of thing I’m surprised even exists. Its downside is that it isn’t super expressive, but the af_heart voice is extremely clean, and Kokoro is way more reliable than other TTS models: It doesn’t have the common failure mode where you occasionally have a couple extra syllables thrown in because you picked a bad seed.

If you want something that can do convincing voice acting, either pay for ElevenLabs or keep waiting. If you’re trying to build a local AI assistant, Kokoro is perfect, just use that and check the space again in like 6 months to see if something’s beaten it. https://huggingface.co/hexgrad/Kokoro-82M

replies(2): >>45117058 #>>45117949 #

13. Uehreka ◴[03 Sep 25 14:52 UTC] No.45116513[source]▶

>>45114884 (TP) #

Their comments about the singing and background music are odd. It’s been a while since I’ve done academic research, but something about those comments gave me a strong “we couldn’t figure out how to make background music go away in time for our paper submission, so we’re calling it a feature” vibe as opposed to a “we genuinely like this and think its a differentiator” vibe.

replies(1): >>45119731 #

14. selkin ◴[03 Sep 25 14:54 UTC] No.45116546[source]▶

>>45115109 #

Results correlate to investment, and there’s more in synthesizing female coded voices. As for the why female coded voices gets more investments, we all know, only difference is in attitude towards that (the correct answer, of course, is “it sucks”)

replies(1): >>45117249 #

15. watsonmusic ◴[03 Sep 25 14:58 UTC] No.45116592[source]▶

>>45115714 #

11labs is facing a real competitor

16. watsonmusic ◴[03 Sep 25 15:04 UTC] No.45116655[source]▶

>>45114973 #

bonus usage

17. watsonmusic ◴[03 Sep 25 15:04 UTC] No.45116659[source]▶

>>45115907 #

it's not oss

18. lynx97 ◴[03 Sep 25 15:04 UTC] No.45116663[source]▶

>>45116238 #

I cobbled together llm-tts to run as many local (and remote) TTs models s I could find and get working.

https://github.com/mlang/llm-tts

Strictly speaking, even music generation fits the usage pattern: text in, audio out.

llm-tts is far from complete, but it makes it relatively "easy" to try a few models in an uniform way.

19. refulgentis ◴[03 Sep 25 15:38 UTC] No.45117058{3}[source]▶

>>45116434 #

There's a certain know-nothing feeling I get that makes me worried if we start at the link (which has data showing it > ElevenLabs quality), jump to eh it's actually worse than anything I've heard then last 2 years, and end up at "none are as good as ElevenLabs" - the recommendation and commentary on it, of course, has nothing to do with my feeling, cheers

20. nipponese ◴[03 Sep 25 15:41 UTC] No.45117083[source]▶

>>45116238 #

Not OS or local, but just try ChatGPT Voice Conversation mode. To my ears, it's a generation ahead of these VibeVoice samples.

21. recursive ◴[03 Sep 25 15:56 UTC] No.45117249{3}[source]▶

>>45116546 #

We all know? Female voices have better intelligibility? That's my guess anyway.

replies(2): >>45117432 #>>45117761 #

22. selkin ◴[03 Sep 25 16:11 UTC] No.45117432{4}[source]▶

>>45117249 #

If you don't know, it's on you to learn. If you do know and prefer to make an asshole of yourself, that's also on you.

23. kadoban ◴[03 Sep 25 16:34 UTC] No.45117761{4}[source]▶

>>45117249 #

There's a lot of money and effort spent in satisfying the sexual desires of (predominantly straight) men. There's not typically quite as much interest in doing the same for women.

For example I've been looking at models and loras for generating images, and the boards are _full_ of ones that will generate women well or in some particular style. Quite often at least a couple of the preview images for each are hidden behind a button because they contain nudity. Clearly the intent is that they are at least able to generate porn containing women. There's a small handful that are focused on men and they're very aware of it, they all have notes lampshading how oddball they are to even exist.

I would expect that this is not as pronounced an effect in the world generating speech, but it must still exist.

replies(1): >>45119407 #

24. sandreas ◴[03 Sep 25 16:48 UTC] No.45117949{3}[source]▶

>>45116434 #

What is your opinion about F5-TTS or Fish-TTS?

replies(1): >>45123530 #

25. skripp ◴[03 Sep 25 16:51 UTC] No.45117982[source]▶

>>45114884 (TP) #

The male Chinese speakers had THICK American accents. Nothing really wrong with the language, but think the stereotype German speaking English. That was kind of strange to me.

replies(1): >>45118004 #

26. riquito ◴[03 Sep 25 16:53 UTC] No.45117999[source]▶

>>45116238 #

Probably not even the best ones, but among some recent models I find Dia and Orpheus more natural

- http://dia-tts.com/

- https://github.com/canopyai/Orpheus-TTS

27. ascorbic ◴[03 Sep 25 16:53 UTC] No.45118004[source]▶

>>45117982 #

I think it's because it was using the American voice for it. Conversely the female voice in the Mandarin conversation spoke English with a Chinese accent.

28. eaglehead ◴[03 Sep 25 17:58 UTC] No.45118687[source]▶

>>45116262 #

I agree. We switched from elevenlabs to chatterbox (hosted on Resemble.ai) and it is much much cheaper and better.

29. lacy_tinpot ◴[03 Sep 25 19:09 UTC] No.45119407{5}[source]▶

>>45117761 #

I think this is a very lazy kind of cultural analysis. The reason female voices are being chosen over male ones is a little more multifaceted than just SEX. Heterosexual women also tend to prefer female voices over male ones.

Female voices are often rated as being clearer, easier to understand, "warmer", etc.

Why this is the case is still an open question, but it's definitely more complex than just SEX.

replies(2): >>45120715 #>>45122634 #

30. johanyc ◴[03 Sep 25 19:24 UTC] No.45119535[source]▶

>>45114884 (TP) #

The Chinese is good. The Mandarin to English example she sounds native. The English to Mandarin sounds good too but he does have an English speaker's accent, which I think is intentional.

31. phildougherty ◴[03 Sep 25 19:47 UTC] No.45119731[source]▶

>>45116513 #

Totally felt the same way! Singing happens spontaneously? What?

replies(1): >>45120529 #

32. whimsicalism ◴[03 Sep 25 21:17 UTC] No.45120502[source]▶

>>45116238 #

i think orpheus and sesame sound better

33. lyu07282 ◴[03 Sep 25 21:19 UTC] No.45120529{3}[source]▶

>>45119731 #

They mention that in the FAQ here: https://github.com/microsoft/VibeVoice/tree/main?tab=readme-...

> In fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.

It's not a bug, it's a feature! Okaaaaay

34. selkin ◴[03 Sep 25 21:44 UTC] No.45120715{6}[source]▶

>>45119407 #

That you consider it sex (rather than gender), is exactly why there’s a preference for female coded voices. Consider where we do hear male recorded voices used as default.

replies(3): >>45120820 #>>45123747 #>>45124345 #

35. lastdong ◴[03 Sep 25 21:50 UTC] No.45120756[source]▶

>>45116262 #

Chatterbox sounds great, their demo page is a good introduction: https://resemble-ai.github.io/chatterbox_demopage/

36. recursive ◴[03 Sep 25 21:57 UTC] No.45120820{7}[source]▶

>>45120715 #

Overloaded term. It was a reference to the parent's reference.

> satisfying the sexual desires of

So, "sex" as a reference to "sexual desires". In English, it just so happens that "sex" has other meanings, but those weren't in play at the time.

37. iansinnott ◴[04 Sep 25 00:53 UTC] No.45122185[source]▶

>>45114884 (TP) #

The English/Mandarin section was VERY impressive. The accents of both the woman speaking English and the man speaking Chinese were spot on. Both sound very convincingly like they are speaking a second language, which anyone here can hear from the Chinese woman speaking English voice. I'd like to add that the foreigner speaking Chinese was also spot on.

38. kadoban ◴[04 Sep 25 01:58 UTC] No.45122634{6}[source]▶

>>45119407 #

I don't think that this is the only factor, I just suspect that it is _a_ factor.

replies(1): >>45132575 #

39. satellite2 ◴[04 Sep 25 02:54 UTC] No.45122997[source]▶

>>45116238 #

Elevenlabs v3 (not local)

40. brettpro ◴[04 Sep 25 04:17 UTC] No.45123530{4}[source]▶

>>45117949 #

I recently implemented Fish for a project and found it adequate for TTS but wildly impressive in voice cloning. My POC originally required 3-10 audio samples but I removed the minimum because it could usually one shot it.

The model is good, but I will say their inference code leaves a lot to be desired. I had to rewrite large portions of it for simple things like correct chunking and streaming. The advertised expressive keywords are very much hit and miss, and the devs have gone dark unfortunately.

41. pylotlight ◴[04 Sep 25 05:00 UTC] No.45123747{7}[source]▶

>>45120715 #

woosh

42. akimbostrawman ◴[04 Sep 25 06:54 UTC] No.45124345{7}[source]▶

>>45120715 #

How the hell would you determine someone's self assigned social gender based on there voice which is a result of there physical sex.

43. popalchemist ◴[04 Sep 25 07:26 UTC] No.45124570[source]▶

>>45116238 #

Higgs Audio v2 is currently SOTA in OSS TSS.

44. lacy_tinpot ◴[04 Sep 25 21:44 UTC] No.45132575{7}[source]▶

>>45122634 #

>There's not typically quite as much interest in doing the same for women.

Women also prefer female voices.

replies(1): >>45162390 #

45. kadoban ◴[07 Sep 25 21:34 UTC] No.45162390{8}[source]▶

>>45132575 #

Okay. I'd happily believe that, it doesn't contradict what I said.

The quote you have from me is from this context:

> There's a lot of money and effort spent in satisfying the sexual desires of (predominantly straight) men. There's not typically quite as much interest in doing the same for women.

In that context, your response is impossible to respond to. Do you even disagree with what I said or do you (like me) just think that there are other factors in addition?

Any particular reason you're being kind of a dick btw?

↑