←back to thread

688 points crescit_eundo | 1 comments | | HN request time: 0.216s | source
1. kmeisthax ◴[] No.42143243[source]
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.