The bitter lesson is coming for tokenization

(lucalp.dev)

296 points todsacerdoti | 2 comments | 24 Jun 25 14:14 UTC | HN request time: 0.001s | source

Show context

Scene_Cast2 ◴[24 Jun 25 15:24 UTC] No.44367229[source]▶

I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.

Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.

replies(6): >>44367418 #>>44367537 #>>44367791 #>>44367849 #>>44368629 #>>44369055 #

blackbear_ ◴[24 Jun 25 15:52 UTC] No.44367537[source]▶

>>44367229 #

While the theoretical bottleneck is there, it is far less restrictive than what you are describing, because the number of almost orthogonal vectors grows exponentially with ambient dimensionality. And orthogonality is what matters to differentiate between different vectors: since any distribution can be expressed as a mixture of Gaussians, the number of separate concepts that you can encode with such a mixture also grows exponentially

replies(1): >>44368939 #

Scene_Cast2 ◴[24 Jun 25 18:01 UTC] No.44368939[source]▶

>>44367537 #

I agree that you can encode any single concept and that the encoding space of a single top pick grows exponentially.

However, I'm talking about the probability distribution of tokens.

replies(1): >>44372279 #

anonymoushn ◴[24 Jun 25 23:49 UTC] No.44372279{3}[source]▶

>>44368939 #

I think within the framework of "almost-orthogonal axes" you can still create a vector that has the desired mix of projections onto any combination of these axes?

replies(1): >>44374144 #

1. yorwba ◴[25 Jun 25 06:23 UTC] No.44374144{4}[source]▶

>>44372279 #

No. You can fit an exponential number of almost-orthogonal vectors into the input space, but the number of not-too-similar probability distributions over output tokens is also exponential in the output dimension. This is fine if you only care about a small subset of distributions (e.g. those that only assign significant probability to at most k tokens), but if you pick any random distribution, it's unlikely to be represented well. Fortunately, this doesn't seem to be much of an issue in practice and people even do top-k sampling intentionally.

replies(1): >>44378990 #

2. anonymoushn ◴[25 Jun 25 16:17 UTC] No.44378990[source]▶

>>44374144 (TP) #

I see. You're right. I was either badly mistaken or only thinking about small k.

↑