Neural audio codecs: how to get audio into LLMs

(kyutai.org)

Show context

trollbridge ◴[21 Oct 25 13:34 UTC] No.45655616[source]▶

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #

1. benob ◴[21 Oct 25 13:46 UTC] No.45655754[source]▶

>>45655616 #

Audio tokenization consumes at least 4x tokens versus text. So there is an efficiency problem to start with. Then is there enough audio data to train a LLM from scratch?

replies(3): >>45655785 #>>45656849 #>>45663245 #

2. trollbridge ◴[21 Oct 25 13:50 UTC] No.45655785[source]▶

>>45655754 (TP) #

Start an MVNO that offers cheaper phone plans and and train on all those phone calls.

There are big libraries of old speeches.

Simply capture all all current radio/tv transmissions and train on that (we've already established copyright doesn't apply to LLM training, right?)

replies(1): >>45656245 #

3. miki123211 ◴[21 Oct 25 14:25 UTC] No.45656245[source]▶

>>45655785 #

> Start an MVNO that offers cheaper phone plans and and train on all those phone calls.

q: What is 2+2?

A: The warranty for your car has expired...

4. 542354234235 ◴[21 Oct 25 15:14 UTC] No.45656849[source]▶

>>45655754 (TP) #

Don't we have tens of thousands of hours (hundreds of thousands?) of closed captioned tv shows and movies? How many hours of news broadcasts with transcripts do we have? Maybe I just don't understand what is needed, but it seems like we have a lot of data to work with.

replies(2): >>45656942 #>>45656992 #

5. roboror ◴[21 Oct 25 15:21 UTC] No.45656942[source]▶

>>45656849 #

Sure but that needs to be licensed

6. cruffle_duffle ◴[21 Oct 25 15:26 UTC] No.45656992[source]▶

>>45656849 #

Correct me if I’m wrong but you need more than just closed captions. You need precise timing too. I’d think you’d need the text to line up exactly with the audio so when the voice makes an “A” sound the text it aligns with is “A” as well.

So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up.

But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!

replies(1): >>45657447 #

7. vvolhejn ◴[21 Oct 25 16:02 UTC] No.45657447{3}[source]▶

>>45656992 #

Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.

See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037

replies(1): >>45658167 #

8. cruffle_duffle ◴[21 Oct 25 16:57 UTC] No.45658167{4}[source]▶

>>45657447 #

Sweet!

9. cyberax ◴[21 Oct 25 23:49 UTC] No.45663245[source]▶

>>45655754 (TP) #

Yup. You can use Mozilla's corpus: https://commonvoice.mozilla.org/en

It mostly uses the UN reports as a source of parallel translated texts, so the language is quite a bit stilted. But it's a good start.

↑