Most active commenters
  • umanwizard(6)
  • numpad0(5)
  • stickfigure(3)

←back to thread

688 points crescit_eundo | 17 comments | | HN request time: 0.655s | source | bottom
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
1. numpad0 ◴[] No.42144207[source]
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
replies(2): >>42148388 #>>42150181 #
2. umanwizard ◴[] No.42148388[source]
Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

replies(2): >>42150302 #>>42150512 #
3. int_19h ◴[] No.42150181[source]
You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.

But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.

replies(1): >>42155255 #
4. stickfigure ◴[] No.42150302[source]
If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.

But the residents of England do in fact speak English, and English is a phonetic language, so there's an inherent impedance mismatch between Chinese characters and English language. I can make up words in English and write them down which don't necessarily have Chinese written equivalents (and probably, vice-versa?).

replies(1): >>42150670 #
5. skissane ◴[] No.42150512[source]
> Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

The problem is – in writing Japanese with kanji, lots of somewhat arbitrary decisions had to be made. Which kanji to use for which native Japanese word? There isn't always an obviously best choice from first principles. But that's not a problem in practice, because a tradition developed of which kanjii to use for which Japanese word (kun'yomi readings). For English, however, we don't have such a tradition. So it isn't clear which Chinese character to use for each English word. If two people tried to write English with Chinese characters independently, they'd likely make different character choices, and the mutual intelligibility might be poor.

Also, while neither Japanese nor Korean belongs to the same language family as Chinese, both borrowed lots of words from Chinese. In Japanese, a lot of use of kanji (especially on'yomi reading) is for borrowings from Chinese. Since English borrowed far less terms from Chinese, this other method of "deciding which character(s) to use" – look at the word's Chinese etymology – largely doesn't work for English given very few English words have Chinese etymology.

Finally, they also invented kanji in Japan for certain Japanese words – kokuji. The same thing happened for Korean Hanja (gukja), to a lesser degree. Vietnamese Chữ Nôm contains thousands of invented-in-Vietnam characters. Probably, if English had adopted Chinese writing, the same would have happened. But again, deciding when to do it and if so how is a somewhat arbitrary choice, which is impossible outside of a real societal tradition of doing it.

> The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

Using the Latin alphabet changed English, just as using Chinese characters changed Japanese, Korean and Vietnamese. If English had used Chinese characters instead of the Latin alphabet, it would be a very different language today. Possibly not in grammar, but certainly in vocabulary.

6. umanwizard ◴[] No.42150670{3}[source]
> If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.

That’s not what I mean at all. I mean even if spoken English were exactly the same as it is now, it could have been written with Chinese characters, and indeed would have been if England had been in the Chinese sphere of cultural influence when literacy developed there.

> English is a phonetic language

What does it mean to be a “phonetic language”? In what sense is English “more phonetic” than the Chinese languages?

> I can make up words in English and write them down which don’t necessarily have Chinese written equivalents

Of course. But if English were written with Chinese characters people would eventually agree on characters to write those words with, just like they did with all the native Japanese words that didn’t have Chinese equivalents but are nevertheless written with kanji.

Here is a famous article about how a Chinese-like writing system would work for English: https://www.zompist.com/yingzi/yingzi.htm

replies(2): >>42154464 #>>42155104 #
7. stickfigure ◴[] No.42154464{4}[source]
> In what sense is English “more phonetic” than the Chinese languages?

Written English vs written Chinese.

How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth? These words have subtle, humorous, and entertaining meanings by way of twisting the sounds of other existing words. Shakespeare was a master of this kind of wordplay and invented a surprising number of words we use today.

I'm not saying you can't have the same phenomenon in spoken Chinese. But how do you write it down without a phonetic alphabet? And if you can't write it down, how do you share it to a wide audience?

replies(1): >>42154867 #
8. umanwizard ◴[] No.42154867{5}[source]
> How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth?

With Chinese characters, of course. Why wouldn’t you be able to?

In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.

> But how do you write it down without a phonetic alphabet?

In general to write a newly coined word you would repurpose characters that sound the same as the newly coined word.

Every syllable that can possibly be uttered according to mandarin phonology is represented by some character (usually many), so this is always possible.

---

Regardless, to reiterate the original point: I'm not claiming Chinese characters are better or more flexible than alphabetic writing. They're not. I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is (other than the fact that a lot of its vocabulary comes from Chinese, but that's not a real counterpoint given that there is lots of native, non-Chinese-derived vocabulary that's still written with kanji).

It would be possible to write Japanese entirely in the Latin alphabet, or English entirely with some system similar to Chinese characters, with minimal to no change to the structure of the language.

replies(2): >>42155215 #>>42162019 #
9. numpad0 ◴[] No.42155104{4}[source]
"Donald Trump" in CJK, taken from Wikipedia page URL and as I hear it - each are close enough[1] and natural enough in each respective languages but none of it are particularly useful for counting R in strawberry:

  C: 唐納·川普, "Thangnar Changpooh"  
  J: ドナルド・トランプ, "Donaludo Toranpu"  
  K: 도널드 트럼프, "D'neldeh Tlempeuh"  
> What does it mean to be a “phonetic language”?

Means the script is intended to record pronunciation rather than intention, e.g. it's easy to see how "cow" is intended to be pronounced but it's not necessarily clear what a cow is; ideographic script on the other hand focuses on meaning, e.g. "魚" is supposed to look like a fish but pronunciation varies from "yueh", "sakana", "awe", etc.

1: I tried looking up other notable figures, but thought this person having entertainment background tends to illustrate the point more clearly

10. numpad0 ◴[] No.42155215{6}[source]
> I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is

what? No, anything but IPA(only technically) and that language's native writings work for pronunciations. Hiragana, Hangul, or Chữ Quốc Ngữ, would not exist otherwise.

e: would _not_ exist

replies(1): >>42155340 #
11. numpad0 ◴[] No.42155255[source]
I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.

And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.

replies(1): >>42159182 #
12. umanwizard ◴[] No.42155340{7}[source]
Then why are both English and Latin represented with Latin characters despite having a completely different phoneme inventory?
replies(1): >>42156141 #
13. numpad0 ◴[] No.42156141{8}[source]
Because one is distant ancestor of the other...? It never adopted writing system from outside. The written and spoken systems co-evolved from a clean slate.
replies(1): >>42158827 #
14. umanwizard ◴[] No.42158827{9}[source]
That’s not true. English is not a descendant of Latin, and the Latin alphabet was adopted from the outside, replacing Anglo-Saxon runes (also called the Futhorc script).

Just like kanji are not native to Japanese.

15. int_19h ◴[] No.42159182{3}[source]
I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.
16. stickfigure ◴[] No.42162019{6}[source]
> In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.

Nonsense. There is zero chance in hell that if you combine the pictographs for "thing", "a", "ma", and "gibberish", that someone reading that is going to reproduce the sound thingamajibber. It just does not work. The meme does not replicate.

There may be other virtues of pictographic written language, but reproducing sounds is not one of them. And - as any Shakespeare fan will tell you - tweaking the sounds of English cleverly is rather important. If you can't reproduce this behavior, you're losing something in translation. So to speak.

replies(1): >>42162107 #
17. umanwizard ◴[] No.42162107{7}[source]
Chinese characters aren't pictographs, so whether English could be written with pictographs is irrelevant to this discussion.

Each Chinese character represents a syllable (in Chinese languages) or a small set of possible sequences of syllables (in Japanese).

And yes, in Chinese languages, new words are created from characters that sound like the parts of the new word, all the time.