←back to thread

296 points todsacerdoti | 2 comments | | HN request time: 0s | source
Show context
Scene_Cast2 ◴[] No.44367229[source]
I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.

Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.

replies(6): >>44367418 #>>44367537 #>>44367791 #>>44367849 #>>44368629 #>>44369055 #
1. kevingadd ◴[] No.44367791[source]
It seems like you're assuming that models are trying to predict the next token. Is that really how they work? I would have assumed that tokenization is an input-only measure, so you have perhaps up to 50k unique input tokens available, but output is raw text or synthesized speech or an image. The output is not tokens so there are no limitations on the output.
replies(1): >>44367970 #
2. anonymoushn ◴[] No.44367970[source]
yes, in typical architectures for models dealing with text, the output is a token from the same vocabulary as the input.