(microsoft.github.io)

448 points lastdong | 2 comments | 03 Sep 25 10:44 UTC | HN request time: 0s | source

1. viggity ◴[03 Sep 25 12:30 UTC] No.45114924[source]▶

I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.

I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

replies(1): >>45116568 #

2. watsonmusic ◴[03 Sep 25 14:56 UTC] No.45116568[source]▶

>>45114924 (TP) #

this model is superb

↑

VibeVoice: A Frontier Open-Source Text-to-Speech Model