←back to thread

228 points nkko | 1 comments | | HN request time: 0s | source
Show context
amelius ◴[] No.43891549[source]
Would it be possible for the LLM model to do the tokenization implicitly? So instead of building a separate tokenizer, you just allow the use of any string of characters, then have a neural network that converts that into tokens, where the weights of that network are trained with the rest of the llm.
replies(1): >>43891839 #
kmeisthax ◴[] No.43891839[source]
We already do this. Neural networks can't work with tokens directly - they only take real-numbered vectors and they need differentiable input[0]. So you don't give it token 123, 456, etc; you have to turn each token into a "one-hot encoded" vector that's all zeroes except in the position indexed by the token ID, which gets set to one.

These one-hot encoded vectors are then fed through a linear layer that encodes the token vector down into the hidden state size of the model. e.g. you might have a token vocabulary of 10-100k but a hidden state size of 0.5-2k. Everything else in the model works in hidden state space[1], which has all sorts of higher-level concepts in it.

Now, if we were to remove tokenization, then the encoder needs to do more work in order to get to the same hidden state space we're used to. It might be able to find a more efficient encoding from unpaired bytes to the hidden space, but that seems unlikely, given that the tokenization most models use is already based on the statistical properties of the training set. If we don't automatically pair "anti" or "ism" into a single token before handing it off to the model, then the attention heads on the lower layers in the model have to do the same work.

Given that we used to train models on character sequences, and then moved to tokenization because it was more efficient, I suspect the trade-off is never going to be worth it.

[0] That is, you can't just give it a list of token IDs, because there's no mathematical meaning to token 123.25, nor any meaning to increasing or decreasing token IDs.

[1] This improves performance but makes interperability harder. Most notably, the hidden space's basis vectors are not directly correlated to words or concepts, instead all the concepts exist on a sort of N-dimensional ring.

replies(1): >>43893360 #
1. amelius ◴[] No.43893360[source]
> If we don't automatically pair "anti" or "ism" into a single token before handing it off to the model, then the attention heads on the lower layers in the model have to do the same work.

What I mean is an extra neural network that comes before the input of the llm, which converts characters (or simple 1-hot vectors which correspond to charactes) into tokens (or whatever it is you would call the internal representation of the network). The advantage would be a more unified way of representing the llm, and I guess one downside would be that you'd get a lot of replication in the NN, but perhaps these parts can be merged (have shared weights).