←back to thread

688 points crescit_eundo | 2 comments | | HN request time: 0.428s | source
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
1. ATMLOTTOBEER ◴[] No.42144059[source]
I tend to agree with you. Your post reminded me of https://gwern.net/aunn
replies(1): >>42177854 #
2. gwern ◴[] No.42177854[source]
One neat thing about the AUNN idea is that when you operate at the function level, you get sort of a neural net version of lazy evaluation; in this case, because you train at arbitrary indices in arbitrary datasets you define, you can do whatever you want with tokenization (as long as you keep it consistent and don't retrain the same index with different values). You can format your data in any way you want, as many times as you want, because you don't have to train on 'the whole thing', any more than you have to evaluate a whole data structure in Haskell; you can just pull the first _n_ elements of an infinite list, and that's fine.

So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.

You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)