(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0.201s | source

Show context

trollbridge ◴[21 Oct 25 13:34 UTC] No.45655616[source]▶

An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.

replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #

1. mohsen1 ◴[21 Oct 25 13:52 UTC] No.45655815[source]▶

>>45655616 #

It costs more to train on audio tokens but I'm sure we will get there. Training a model on transcript of a lecture on YouTube vs. training on audio of it will make a difference.

↑

Neural audio codecs: how to get audio into LLMs