(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0.234s | source

Show context

Scene_Cast2 ◴[24 Jun 25 15:24 UTC] No.44367229[source]▶

I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.

Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.

replies(6): >>44367418 #>>44367537 #>>44367791 #>>44367849 #>>44368629 #>>44369055 #

1. unoti ◴[24 Jun 25 15:40 UTC] No.44367418[source]▶

>>44367229 #

I imagine there’s actually combinatorial power in there though. If we imagine embedding something with only 2 dimensions x and y, we can actually encode an unlimited number of concepts because we can imagine distinct separate clusters or neighborhoods spread out over a large 2d map. It’s of course much more possible with more dimensions.

↑

The bitter lesson is coming for tokenization