>From a domain point of view, some are skeptical that bytes are adequate for modelling natural language
If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.
replies(1):
If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.