Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 3 comments | 14 Nov 24 17:05 UTC | HN request time: 0.858s | source

Show context

azeirah ◴[14 Nov 24 22:43 UTC] No.42141993[source]▶

Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #

numpad0 ◴[15 Nov 24 05:34 UTC] No.42144207[source]▶

>>42141993 #

hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English

replies(2): >>42148388 #>>42150181 #

1. int_19h ◴[15 Nov 24 19:43 UTC] No.42150181[source]▶

>>42144207 #

You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.

But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.

replies(1): >>42155255 #

2. numpad0 ◴[16 Nov 24 08:27 UTC] No.42155255[source]▶

>>42150181 (TP) #

I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.

And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.

replies(1): >>42159182 #

3. int_19h ◴[16 Nov 24 20:48 UTC] No.42159182[source]▶

>>42155255 #

I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.

↑