←back to thread

652 points toebee | 9 comments | | HN request time: 3.312s | source | bottom
Show context
toebee ◴[] No.43754125[source]
Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.

replies(11): >>43754718 #>>43754758 #>>43755567 #>>43756264 #>>43756302 #>>43757244 #>>43757317 #>>43757653 #>>43758343 #>>43758672 #>>43768981 #
1. dangoodmanUT ◴[] No.43757653[source]
I know it’s taboo to ask, but I must: where’s the dataset from? Very eager to play around with audio models myself, but I find existing datasets limiting
replies(2): >>43758141 #>>43765198 #
2. zelphirkalt ◴[] No.43758141[source]
Why would that be a taboo question to ask? It should be the question we always ask, when presented with a model and in some cases we should probably reject the model, based on that information.
replies(1): >>43758318 #
3. dangoodmanUT ◴[] No.43758318[source]
Because generally the person asking this question is trying to cancel the model maker
replies(3): >>43758511 #>>43760208 #>>43773535 #
4. tough ◴[] No.43758511{3}[source]
or by replying you expose yourself to handing -proof- of the origins of the training data set to the copyright owner wanting to sue you next
5. deng ◴[] No.43760208{3}[source]
No. It's for giving credit where credit is due. And yes, that includes the question if the people who generated the training data in the first place have given their consent that this can be used for AI training.

It's quite concerning that the community around here is usually livid about FOSS license violations, which typically use copyright law as leverage, but somehow is perfectly OK with training models on copyrighted work and just labels that as "fair use".

replies(1): >>43784025 #
6. xdfgh1112 ◴[] No.43765198[source]
I suspect podcasts, as you have a huge amount of transcribed data with good diction and mic quality. The voices sound like podcast voices to me.
7. fennecfoxy ◴[] No.43773535{3}[source]
Well presumably since they're individuals and not a business the consequences are much less severe legally - but public opinion still won't be great, but since when was it ever, for any new thing?

If I cut up a song or TV show & put it on Youtube (and screech about fair use/parody law) then that's fine, but people will balk at something like this.

AI is here, people.

8. isaacfung ◴[] No.43784025{4}[source]
What AI tools have you used recently? Have you verified if they all use models trained on copyrighted material with permission?
replies(1): >>43796728 #
9. deng ◴[] No.43796728{5}[source]
Ah, that's a classic. "How can you criticize Big Oil and at the same time drive a car!" and voila, the case is closed.

I am allowed to criticize things without having to live like a hermit. I make moderate use of ChatGPT, yet at the same time I think that its training does not fall under fair use, and that creators should get compensated. If OpenAI's business model does not allow for this, then it should fail, and that's fine by me. I lived without ChatGPT, and I can live without it again.