Most active commenters

taeric(3)
TZubiri(3)

Something weird is happening with LLMs and chess

(dynomight.substack.com)

Show context

azeirah ◴[14 Nov 24 22:43 UTC] No.42141993[source]▶

Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #

cschep ◴[14 Nov 24 22:47 UTC] No.42142033[source]▶

>>42141993 #

How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?

replies(2): >>42142126 #>>42142146 #

1. skylerwiernik ◴[14 Nov 24 22:58 UTC] No.42142146[source]▶

>>42142033 #

Couldn't we just make every human readable character a token?

OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"

replies(3): >>42142199 #>>42142203 #>>42142835 #

2. taeric ◴[14 Nov 24 23:04 UTC] No.42142199[source]▶

>>42142146 (TP) #

This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.

That is, the groups are encoding something the model doesn't have to learn.

This is not much astray from "sight words" we teach kids.

replies(2): >>42143246 #>>42145899 #

3. tchalla ◴[14 Nov 24 23:04 UTC] No.42142203[source]▶

>>42142146 (TP) #

aka Character Language Models which have existed for a while now.

4. cco ◴[15 Nov 24 00:32 UTC] No.42142835[source]▶

>>42142146 (TP) #

We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.

There is no advantage to tokenization, it just helps solve limitations in context windows and training.

replies(1): >>42143249 #

5. TZubiri ◴[15 Nov 24 01:51 UTC] No.42143246[source]▶

>>42142199 #

This is just more tokens?

Yup. Just let the actual ML git gud

replies(1): >>42143334 #

6. TZubiri ◴[15 Nov 24 01:52 UTC] No.42143249[source]▶

>>42142835 #

I like this explanation

7. taeric ◴[15 Nov 24 02:08 UTC] No.42143334{3}[source]▶

>>42143246 #

So, put differently, this is just more expensive?

replies(1): >>42152806 #

8. Hendrikto ◴[15 Nov 24 11:22 UTC] No.42145899[source]▶

>>42142199 #

No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626

replies(1): >>42145925 #

9. taeric ◴[15 Nov 24 11:25 UTC] No.42145925{3}[source]▶

>>42145899 #

More tokens to a sequence, though. And since it is learning sequences...

replies(1): >>42147484 #

10. loa_in_ ◴[15 Nov 24 14:58 UTC] No.42147484{4}[source]▶

>>42145925 #

Yeah, suddenly 16k tokens is just 16kb of ASCII instead of ~6kwords

11. TZubiri ◴[16 Nov 24 00:03 UTC] No.42152806{4}[source]▶

>>42143334 #

Expensive in terms of computationally expensive, time expensive, and yes cost expensive.

Worth noting that the relationship between characters to token ratio is probably quadratic or cubic or some other polynomial. So the difference in terms of computational difficulty is probably huge when compared to a character per token.

↑