(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0s | source

Show context

trollbridge ◴[21 Oct 25 13:34 UTC] No.45655616[source]▶

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #

1. ca_tech ◴[21 Oct 25 13:50 UTC] No.45655792[source]▶

>>45655616 #

There is data but nowhere near the amount of written language that is fairly normalized and without the need to account for additional features such as language, dialect, intonation, facial expression, hand gestures. Speech to text is used as the translation layer as it throws many of those other features away and contextualizes it into a set of tokens that are much more efficient to map between languages.

↑

Neural audio codecs: how to get audio into LLMs