(lucalp.dev)

296 points todsacerdoti | 2 comments | 24 Jun 25 14:14 UTC | HN request time: 1.113s | source

1. kgeist ◴[24 Jun 25 23:10 UTC] No.44372056[source]▶

>From a domain point of view, some are skeptical that bytes are adequate for modelling natural language

If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.

replies(1): >>44373669 #

2. dgfitz ◴[25 Jun 25 04:41 UTC] No.44373669[source]▶

I wonder if they treat each letter as a Unicode code point, and each of those is a token? I could see the same being true of other languages.

The bitter lesson is coming for tokenization