←back to thread

296 points todsacerdoti | 1 comments | | HN request time: 0s | source
Show context
rryan ◴[] No.44373939[source]
Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.
replies(2): >>44377004 #>>44377091 #
cschmidt ◴[] No.44377004[source]
Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.
replies(1): >>44377052 #
1. cschmidt ◴[] No.44377052[source]
And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689