Show HN: Dia, an open-weights TTS model for generating realistic dialogue

1. Havoc ◴[21 Apr 25 21:33 UTC] No.43756741[source]▶

Sounds really good & human! Got a fair bit of unexpected artifacts though. e.g. 3 seconds hissing noise before dialogue. And music in background when I added (happy) in an attempt to control tone. Also don't understand how to control the S1 and S2 speakers...is it just random based on temp?

> TODO Docker support

Got this adapted pretty easily. Just latest nvidia cuda container, throw python and modules on it and change server to serve on 0.0.0.0. Does mean it pulls the model every time on startup though which isn't ideal

replies(3): >>43756851 #>>43757435 #>>43757925 #

2. yjftsjthsd-h ◴[21 Apr 25 21:45 UTC] No.43756851[source]▶

>>43756741 (TP) #

> Does mean it pulls the model every time on startup though which isn't ideal

Surely it just downloads to a directory that can be volume mapped?

replies(1): >>43756944 #

3. Havoc ◴[21 Apr 25 21:56 UTC] No.43756944[source]▶

>>43756851 #

Yep. I just didn't spend the time to track down the location tbh. Plus huggingface usually does links to a cache folder that I don't recall the location of

Literally got cuda containers working earlier today so haven't spent a huge amount of time figuring things out

replies(1): >>43764861 #

4. toebee ◴[21 Apr 25 22:59 UTC] No.43757435[source]▶

>>43756741 (TP) #

Thank you for the kind words! Dia wasn’t fine tuned on certain speaker, so you will get random voices every time you run it, unless you add a prompt / fix the seed.

The outputs are a bit unstable, might need to add cleaner training data and run longer training sessions. Hopefully we can do something like OAI Whisper and update with better performing checkpoints!

5. dragonwriter ◴[22 Apr 25 00:26 UTC] No.43757925[source]▶

>>43756741 (TP) #

> Also don't understand how to control the S1 and S2 speakers...

Do a clip with the speakers you want as the audio prompt, add the text of that clip (with speaker tags) of the clip at the beginning of your text prompt, and it clones the voices from your audio prompt for the output.

6. genewitch ◴[22 Apr 25 18:19 UTC] No.43764861{3}[source]▶

>>43756944 #

Its in a dot folder in your home dir on Linux and in %appdata% on windows.