(kyutai.org)

425 points karimf | 2 comments | 21 Oct 25 12:55 UTC | HN request time: 0.443s | source

Show context

trollbridge ◴[21 Oct 25 13:34 UTC] No.45655616[source]▶

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #

1. MichealCodes ◴[21 Oct 25 13:40 UTC] No.45655692[source]▶

>>45655616 #

I don't think we've had the transformer moment for audio training yet, but yes, in theory audio-first models will be much more capable.

replies(1): >>45655709 #

2. trollbridge ◴[21 Oct 25 13:42 UTC] No.45655709[source]▶

>>45655692 (TP) #

Particularly interesting would be transformations between tokenised audio and tokenised text.

I recall someone telling me once up to 90% of communication can be non-verbal, so when an LLM sticks to just text, it's only getting 10% of the data.

↑

Neural audio codecs: how to get audio into LLMs