←back to thread

237 points JnBrymn | 3 comments | | HN request time: 0s | source
Show context
sabareesh ◴[] No.45675879[source]
It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space
replies(3): >>45675953 #>>45676049 #>>45677115 #
1. CuriouslyC ◴[] No.45676049[source]
Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.
replies(2): >>45677876 #>>45677936 #
2. mark_l_watson ◴[] No.45677876[source]
Interesting idea! Haven’t heard that before.
3. yorwba ◴[] No.45677936[source]
You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.

But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.