Popular/hot comments

(microsoft.github.io)

Show context

TheAceOfHearts ◴[03 Sep 25 13:41 UTC] No.45115690[source]▶

Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.

Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

replies(1): >>45115995 #

1. tempodox ◴[03 Sep 25 14:07 UTC] No.45115995[source]▶

>>45115690 #

This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about “AI”, it’s just too absurd.

replies(1): >>45116204 #

2. NitpickLawyer ◴[03 Sep 25 14:25 UTC] No.45116204[source]▶

>>45115995 (TP) #

> with acceptable quality

Compared to IBMs Steven Hawking's chair, maybe. But apple tts is not acceptable quality in any modern understanding of SotA, IMO.

replies(1): >>45116623 #

3. selkin ◴[03 Sep 25 15:01 UTC] No.45116623[source]▶

>>45116204 #

Different use cases:

If you need a not-visual output of text, SoyA is a waste of electrons.

If you want to try and mimic a human speaker, then it ain’t.

Question is why would you need to have the computer sound more human, except for “because I can”.

replies(3): >>45116733 #>>45117806 #>>45119308 #

4. NitpickLawyer ◴[03 Sep 25 15:10 UTC] No.45116733{3}[source]▶

>>45116623 #

I tried listening to audiobooks generated with tts. It takes me out of it most of the time, and I lose focus. That podcast thing from google was the first time I felt like I could listen to an entire thing without feeling the uncanny valley thing. And I knew it was genAI. So I'm looking for that, but for my content. Grab a bunch of articles (long form, deeply researched) and "podcast" them but with natural voices, sans hype. Or books. Have them ready when I'm out and about.

replies(1): >>45117831 #

5. Ukv ◴[03 Sep 25 16:37 UTC] No.45117806{3}[source]▶

>>45116623 #

> Question is why would you need to have the computer sound more human

I think translation would be a big use - maybe translating your voice to another language while maintaining emotion and intonation, or dubbing content (videos, movies, podcasts, ...) that isn't otherwise available in your native language.

Traditional non-ML TTS for longer content like podcasts or audiobooks seems like it'd become grating to the point of being unlistenable, or at least a significantly worse experience. Stands to benefit from more natural sounding voices that can place emphasis in the right places.

Since Stephen Hawking was brought up, there are likely also people with voice-impairing illnesses who would like to speak in their own voice again (in addition to those who are fine with a robotic voice). Or alternatively, people who are uncomfortable with their natural voice and want to communicate closer to how they wish to be perceived.

Could also potentially be used for new forms of interactive media that aren't currently feasible - customised movies, audio dramas where the listener plays a role, videogame NPCs that react with more than just prerecorded lines, etc.

6. andrew_lettuce ◴[03 Sep 25 16:39 UTC] No.45117831{4}[source]▶

>>45116733 #

The Google podcasts are so cringey positive it emotionally pains me. Nobody finds pineapple on pizza that amazing.

replies(1): >>45118054 #

7. lagniappe ◴[03 Sep 25 16:58 UTC] No.45118054{5}[source]▶

>>45117831 #

>Nobody finds pineapple on pizza that amazing

We can't be friends

8. crazygringo ◴[03 Sep 25 19:01 UTC] No.45119308{3}[source]▶

>>45116623 #

Audiobooks and other material you want to listen to (articles, blog posts, etc.).

There's a lot of stuff I don't have time to sit down and read, but want to listen to while I cook/laundry/shower/drive/etc.

Often recordings don't exist. Or when they do, an audiobook just has a bad voiceover artist, or one that just rubs you the wrong way.

The more human text-to-speech sounds, the easier and less distracting it is to listen to. There's real value in it, it's not "because I can".

You know how it's nicer to read in 300 dpi instead of 72 dpi? Or in Garamond rather than Courier? Or in Helvetica rather than Comic Sans? It's like that, only for speech.

↑

VibeVoice: A Frontier Open-Source Text-to-Speech Model