←back to thread

688 points crescit_eundo | 1 comments | | HN request time: 0.494s | source
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
numpad0 ◴[] No.42144207[source]
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
replies(2): >>42148388 #>>42150181 #
umanwizard ◴[] No.42148388[source]
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

replies(2): >>42150302 #>>42150512 #
stickfigure ◴[] No.42150302[source]
If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.

But the residents of England do in fact speak English, and English is a phonetic language, so there's an inherent impedance mismatch between Chinese characters and English language. I can make up words in English and write them down which don't necessarily have Chinese written equivalents (and probably, vice-versa?).

replies(1): >>42150670 #
umanwizard ◴[] No.42150670[source]
> If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.

That’s not what I mean at all. I mean even if spoken English were exactly the same as it is now, it could have been written with Chinese characters, and indeed would have been if England had been in the Chinese sphere of cultural influence when literacy developed there.

> English is a phonetic language

What does it mean to be a “phonetic language”? In what sense is English “more phonetic” than the Chinese languages?

> I can make up words in English and write them down which don’t necessarily have Chinese written equivalents

Of course. But if English were written with Chinese characters people would eventually agree on characters to write those words with, just like they did with all the native Japanese words that didn’t have Chinese equivalents but are nevertheless written with kanji.

Here is a famous article about how a Chinese-like writing system would work for English: https://www.zompist.com/yingzi/yingzi.htm

replies(2): >>42154464 #>>42155104 #
1. numpad0 ◴[] No.42155104[source]
"Donald Trump" in CJK, taken from Wikipedia page URL and as I hear it - each are close enough[1] and natural enough in each respective languages but none of it are particularly useful for counting R in strawberry:

  C: 唐納·川普, "Thangnar Changpooh"  
  J: ドナルド・トランプ, "Donaludo Toranpu"  
  K: 도널드 트럼프, "D'neldeh Tlempeuh"  
> What does it mean to be a “phonetic language”?

Means the script is intended to record pronunciation rather than intention, e.g. it's easy to see how "cow" is intended to be pronounced but it's not necessarily clear what a cow is; ideographic script on the other hand focuses on meaning, e.g. "魚" is supposed to look like a fish but pronunciation varies from "yueh", "sakana", "awe", etc.

1: I tried looking up other notable figures, but thought this person having entertainment background tends to illustrate the point more clearly