VibeVoice: A Frontier Open-Source Text-to-Speech Model

1. baal80spam ◴[03 Sep 25 11:49 UTC] No.45114646[source]▶

Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking.

replies(2): >>45114849 #>>45114891 #

2. x187463 ◴[03 Sep 25 12:19 UTC] No.45114849[source]▶

>>45114646 (TP) #

The giveaway is they will never talk over each other. Only one speaker at a time, consistently.

replies(3): >>45114901 #>>45114974 #>>45115896 #

3. tracker1 ◴[03 Sep 25 12:24 UTC] No.45114891[source]▶

>>45114646 (TP) #

Yeah, a lot of the TTS has gotten really impressive in general. Definitely a clear leap from the TTS stuff I worked with for training simulations a bit over a decade ago. Aside: Installing a sound card (unused) on a windows server just to be able to generate TTS was interesting. It was required by the platform, even if it wasn't used for it.

I generally don't like a lot of the AI generated slop that's starting to pop up on YouTube these days... I do enjoy some of the reddit story channels, but have completely stopped with it all now. With the AI stuff, it really becomes apparent with dates/ages and when numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak.

4. tracker1 ◴[03 Sep 25 12:26 UTC] No.45114901[source]▶

>>45114849 #

Fair enough... though it would be possible to generate that and edit to overlay the speech, introducing stuttering/pauses at the beginning and end of statements then edit the output to overlay the steps.

Would probably want to do similar to balance crossfade anyway... having each speaker's input offset from center instead of straight mono.

5. kaptainscarlet ◴[03 Sep 25 12:35 UTC] No.45114974[source]▶

>>45114849 #

Also the lack of stutter and perfect flow of speech are a dead giveaway

6. kridsdale1 ◴[03 Sep 25 13:57 UTC] No.45115896[source]▶

>>45114849 #

And longer pause between turns than humans would do.