←back to thread

425 points karimf | 1 comments | | HN request time: 0.201s | source
Show context
trollbridge ◴[] No.45655616[source]
An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
replies(5): >>45655692 #>>45655754 #>>45655792 #>>45655815 #>>45656008 #
benob ◴[] No.45655754[source]
Audio tokenization consumes at least 4x tokens versus text. So there is an efficiency problem to start with. Then is there enough audio data to train a LLM from scratch?
replies(3): >>45655785 #>>45656849 #>>45663245 #
1. cyberax ◴[] No.45663245[source]
Yup. You can use Mozilla's corpus: https://commonvoice.mozilla.org/en

It mostly uses the UN reports as a source of parallel translated texts, so the language is quite a bit stilted. But it's a good start.