(lucalp.dev)

296 points todsacerdoti | 3 comments | 24 Jun 25 14:14 UTC | HN request time: 0.422s | source

Show context

rryan ◴[25 Jun 25 05:38 UTC] No.44373939[source]▶

Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.

replies(2): >>44377004 #>>44377091 #

1. hiddencost ◴[25 Jun 25 13:27 UTC] No.44377091[source]▶

>>44373939 #

Well akshually...

I assume you started programming some time this millennia? That's the only way I can explain this "take".

replies(2): >>44377568 #>>44385622 #

2. roflcopter69 ◴[25 Jun 25 14:11 UTC] No.44377568[source]▶

>>44377091 (TP) #

Care to elaborate?

3. vaxman ◴[26 Jun 25 09:21 UTC] No.44385622[source]▶

>>44377091 (TP) #

Roger, who spoke only Chinglish and never paused between words, was working on a VAX FORTRAN program that exchanged tapes with IBM mainframes and a memory mapped section, inventing a new word in the process that still has me rolling decades later: ebsah-dicky-asky-codah

↑

The bitter lesson is coming for tokenization