An ongoing question I have is why effort wasn't put into tokenising speech (instead of transcribed words) and then making an LLM out of that. There are huge amounts of speech available to train on.
replies(5):
I recall someone telling me once up to 90% of communication can be non-verbal, so when an LLM sticks to just text, it's only getting 10% of the data.