An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
replies(5):
There are big libraries of old speeches.
Simply capture all all current radio/tv transmissions and train on that (we've already established copyright doesn't apply to LLM training, right?)
q: What is 2+2?
A: The warranty for your car has expired...
So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up.
But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!
See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037
It mostly uses the UN reports as a source of parallel translated texts, so the language is quite a bit stilted. But it's a good start.