←back to thread

425 points karimf | 1 comments | | HN request time: 0s | source
Show context
amelius ◴[] No.45655505[source]
> Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (...), but it’s a wrapper, not real speech understanding.

But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.

replies(2): >>45655705 #>>45665961 #
1. vvolhejn ◴[] No.45665961[source]
There's a great blog post from Sander Dieleman about exactly this - why do we need a two step pipeline, in particular for images and audio? https://sander.ai/2025/04/15/latents.html

For text, there are a few papers that train the tokenization and language model end-to-end, see: https://arxiv.org/abs/2305.07185