(play.ht)

258 points amrrs | 1 comments | 14 Oct 24 19:16 UTC | HN request time: 0.192s | source

Show context

Mizza ◴[14 Oct 24 21:20 UTC] No.41842147[source]▶

What's SOTA for open source or on-device right now?

I tried building a babelfish with o1, but the transcription in languages other than English are useless. When it gets it correct, the translations are pretty perfect and the voice responses are super fast, but without good transcription it's kind of useless. So close!

replies(5): >>41842153 #>>41842200 #>>41842281 #>>41843179 #>>41846783 #

diggan ◴[14 Oct 24 21:34 UTC] No.41842281[source]▶

>>41842147 #

I was literally just looking at that today, and the best one I came across was F5-TTS: https://swivid.github.io/F5-TTS/

Only thing missing (for me) is "emotion tokens" instead of forcing the entire generation to be with a specific emotion, as the generated voice is a bit too robotic otherwise.

replies(1): >>41842581 #

moffkalast ◴[14 Oct 24 22:08 UTC] No.41842581[source]▶

>>41842281 #

> based on flow matching with Diffusion Transformer

Yeah that's not gonna be realtime. It's really odd that we currently have two options, ViTS/Piper that runs at a ludicrous speed on a CPU and is kinda ok, and these slightly more natural versions a la StyleTTS2 that take 2 minutes to generate a sentence with CUDA acceleration.

Like, is there a middle ground? Maybe inverting one of the smaller whispers or something.

replies(2): >>41842748 #>>41842974 #

1. gunalx ◴[14 Oct 24 22:45 UTC] No.41842974[source]▶

>>41842581 #

Bark?

↑

Play 3.0 mini – A lightweight, reliable, cost-efficient Multilingual TTS model