←back to thread

652 points toebee | 1 comments | | HN request time: 0.264s | source
Show context
toebee ◴[] No.43754125[source]
Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

replies(11): >>43754718 #>>43754758 #>>43755567 #>>43756264 #>>43756302 #>>43757244 #>>43757317 #>>43757653 #>>43758343 #>>43758672 #>>43768981 #
cchance ◴[] No.43758672[source]
Its really amazing cant wait to play with it some, the samples are great... but oddly all seem... really fast, like they'd be perfect but they feel like they're playing at 1.2x speed or is that just me?
replies(1): >>43760433 #
1. claiir ◴[] No.43760433[source]
It’s not just you. The speedup is an artefact of the CFG (Classifier-Free Guidance) the model uses. The other problem is the speedup isn’t constant—it actually accelerates as the generation progresses. The Parakeet paper [1] (which OP lifted their model architecture almost directly from [2]) gives a fairly robust treatment to the matter:

> When we apply CFG to Parakeet sampling, quality is significantly improved. However, on inspecting generations, there tends to be a dramatic speed-up over the duration of the sample (i.e. the rate of speaking increases significantly over time). Our intuition for this problem is as follows: Say that is our model is (at some level) predicting phonemes and the ground truth distribution for the next phoneme occuring is 25% at a given timestep. Our conditional model may predict 20%, but because our uncondtional model cannot see the text transcription, its prediction for the correct next phoneme will be much lower, say 5%. With a reasonable level of CFG, because [the logit delta] will be large for the correct next phoneme, we’ll obtain a much higher final probability, say 50%, which biases our generation towards faster speech. [emphasis mine]

Parakeet details a solution to this, though this was not adopted (yet?) by Dia:

> To address this, we introduce CFG-filter, a modification to CFG that mitigates the speed drift. The idea is to first apply the CFG calculation to obtain a new set of logits as before, but rather than use these logits to sample, we use these logits to obtain a top-k mask to apply to our original conditional logits. Intuitively, this serves to constrict the space of possible “phonemes” to text-aligned phonemes without heavily biasing the relative probabilities of these phonemes (or for example, start next word vs pause more). [emphasis mine]

The paper contains audio samples with ablations you can listen to.

[1]: https://jordandarefsky.com/blog/2024/parakeet/#classifier-fr...

[2]: https://news.ycombinator.com/item?id=43758686