←back to thread

296 points todsacerdoti | 1 comments | | HN request time: 0.201s | source
Show context
Scene_Cast2 ◴[] No.44367229[source]
I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.

Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.

replies(6): >>44367418 #>>44367537 #>>44367791 #>>44367849 #>>44368629 #>>44369055 #
molf ◴[] No.44367849[source]
The key insight is that you can represent different features by vectors that aren't exactly perpendicular, just nearly perpendicular (for example between 85 and 95 degrees apart). If you tolerate such noise then the number of vectors you can fit grows exponentially relative to the number of dimensions.

12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.

[1]: https://www.3blue1brown.com/lessons/mlp#superposition

replies(1): >>44375153 #
1. bravesoul2 ◴[] No.44375153[source]
High dimensions are weird!