An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
replies(5):
There are big libraries of old speeches.
Simply capture all all current radio/tv transmissions and train on that (we've already established copyright doesn't apply to LLM training, right?)
q: What is 2+2?
A: The warranty for your car has expired...