Play 3.0 mini – A lightweight, reliable, cost-efficient Multilingual TTS model

1. phkahler ◴[14 Oct 24 20:10 UTC] No.41841397[source]▶

Sounds quite good, but this prompt is NOT what I'd expect an automated system to feed into it:

“I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“

Phone numbers and others were read nicely, but apparently a string of alphanumerics for an order number aren't handled well yet.

replies(3): >>41841433 #>>41841899 #>>41842302 #

2. amrrs ◴[14 Oct 24 20:13 UTC] No.41841433[source]▶

>>41841397 (TP) #

Sorry, Do you mean to the audio for this text is not good?

“I’ve successfully processed your order and I’d like to confirm your product ID. It is A as in Alpha, 1, 2, 3, B as in Bravo, 5, 6, 7, Z as in Zulu, 8, 9, 0, X as in X-ray.“

I thought this was included in the demo, it seemed okay!

replies(2): >>41843549 #>>41848002 #

3. BoorishBears ◴[14 Oct 24 20:56 UTC] No.41841899[source]▶

>>41841397 (TP) #

Most of these prompts come from LLMs, so it's trivial to instruct them to provide a string that's broken out like that.

Also not the end of the world to process stuff like this with a regex.

Most of these newer TTS models require this type of formatting to reliably state long strings of numbers and IDs

4. diggan ◴[14 Oct 24 21:38 UTC] No.41842302[source]▶

>>41841397 (TP) #

> Phone numbers and others were read nicely

The phone numbers were not naturally read at all. A human would have read a grouping of 123-456-789 like "123", "456", "789", but instead the model generated something like "123", "45", "6789". Listen to the RVSP example again and you'll know what I mean. The pacing is generally off for normal text too, but extra noticeable for the numbers.

My hunch would be that it's because of tokenization, but I wouldn't be able to say that's the issue for sure. Sounds like it though :)

replies(1): >>41851695 #

5. mrkstu ◴[15 Oct 24 00:01 UTC] No.41843549[source]▶

>>41841433 #

'Alpha' is kind of swallowed and Bravo is mispronounced.

6. phkahler ◴[15 Oct 24 12:47 UTC] No.41848002[source]▶

>>41841433 #

>> Sorry, Do you mean to the audio for this text is not good?

No, the audio was OK or even good. The example seems to be an automated response from some system where a human has just placed an order. The order number is A123B567Z890X but if we want our system to "read back" the order number we apparently have to specially format the text. I suppose for the clarifying stuff "Alpha Bravo" that's a good idea, but separating digits and all those commas?

7. bryananderson ◴[15 Oct 24 18:39 UTC] No.41851695[source]▶

>>41842302 #

In this case it’s not tokenization. I wrote the text preprocessing code that deals with spacing these numbers. This is good feedback. It’s optimized for US-style 10-digit phone numbers, and it should be more flexible than that. For example, if I was reading a US phone number such as (123) 456-7890 over the phone and wanted to make sure it was heard correctly, I’d say “123”, “456”, “78”, “90”. But a 9-digit phone number should be spaced as you said.