←back to thread

696 points crescit_eundo | 1 comments | | HN request time: 0.232s | source
1. kmeisthax ◴[] No.42143243[source]
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.