(microsoft.github.io)

448 points lastdong | 1 comments | 03 Sep 25 10:44 UTC | HN request time: 0.467s | source

Show context

viggity ◴[03 Sep 25 12:30 UTC] No.45114924[source]▶

I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.

I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

replies(1): >>45116568 #

1. watsonmusic ◴[03 Sep 25 14:56 UTC] No.45116568[source]▶

>>45114924 #

this model is superb

↑

VibeVoice: A Frontier Open-Source Text-to-Speech Model