←back to thread

688 points crescit_eundo | 1 comments | | HN request time: 0s | source
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
numpad0 ◴[] No.42144207[source]
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
replies(2): >>42148388 #>>42150181 #
umanwizard ◴[] No.42148388[source]
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

replies(2): >>42150302 #>>42150512 #
1. skissane ◴[] No.42150512{3}[source]
> Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

The problem is – in writing Japanese with kanji, lots of somewhat arbitrary decisions had to be made. Which kanji to use for which native Japanese word? There isn't always an obviously best choice from first principles. But that's not a problem in practice, because a tradition developed of which kanjii to use for which Japanese word (kun'yomi readings). For English, however, we don't have such a tradition. So it isn't clear which Chinese character to use for each English word. If two people tried to write English with Chinese characters independently, they'd likely make different character choices, and the mutual intelligibility might be poor.

Also, while neither Japanese nor Korean belongs to the same language family as Chinese, both borrowed lots of words from Chinese. In Japanese, a lot of use of kanji (especially on'yomi reading) is for borrowings from Chinese. Since English borrowed far less terms from Chinese, this other method of "deciding which character(s) to use" – look at the word's Chinese etymology – largely doesn't work for English given very few English words have Chinese etymology.

Finally, they also invented kanji in Japan for certain Japanese words – kokuji. The same thing happened for Korean Hanja (gukja), to a lesser degree. Vietnamese Chữ Nôm contains thousands of invented-in-Vietnam characters. Probably, if English had adopted Chinese writing, the same would have happened. But again, deciding when to do it and if so how is a somewhat arbitrary choice, which is impossible outside of a real societal tradition of doing it.

> The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

Using the Latin alphabet changed English, just as using Chinese characters changed Japanese, Korean and Vietnamese. If English had used Chinese characters instead of the Latin alphabet, it would be a very different language today. Possibly not in grammar, but certainly in vocabulary.