Most active commenters

umanwizard(6)
numpad0(5)
aithrowawaycomm(4)
TZubiri(4)
taeric(3)
ipsum2(3)
PittleyDunkin(3)
int_19h(3)
stickfigure(3)

Popular/hot comments

>>42142384 #
>>42142807 #
>>42142146 #

←back to thread

Something weird is happening with LLMs and chess

(dynomight.substack.com)

1. azeirah ◴[14 Nov 24 22:43 UTC] No.42141993[source]▶

>>42138289 (OP) #

Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #

2. cschep ◴[14 Nov 24 22:47 UTC] No.42142033[source]▶

>>42141993 (TP) #

How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?

replies(2): >>42142126 #>>42142146 #

3. viraptor ◴[14 Nov 24 22:56 UTC] No.42142126[source]▶

>>42142033 #

That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.

4. skylerwiernik ◴[14 Nov 24 22:58 UTC] No.42142146[source]▶

>>42142033 #

Couldn't we just make every human readable character a token?

OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"

replies(3): >>42142199 #>>42142203 #>>42142835 #

5. taeric ◴[14 Nov 24 23:04 UTC] No.42142199{3}[source]▶

>>42142146 #

This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.

That is, the groups are encoding something the model doesn't have to learn.

This is not much astray from "sight words" we teach kids.

replies(2): >>42143246 #>>42145899 #

6. tchalla ◴[14 Nov 24 23:04 UTC] No.42142203{3}[source]▶

>>42142146 #

aka Character Language Models which have existed for a while now.

7. aithrowawaycomm ◴[14 Nov 24 23:25 UTC] No.42142384[source]▶

>>42141993 (TP) #

FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.

E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.

replies(6): >>42142733 #>>42142807 #>>42143239 #>>42143800 #>>42144596 #>>42146428 #

8. ipsum2 ◴[15 Nov 24 00:17 UTC] No.42142733[source]▶

>>42142384 #

The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.

replies(1): >>42142913 #

9. Der_Einzige ◴[15 Nov 24 00:29 UTC] No.42142807[source]▶

>>42142384 #

I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.

replies(4): >>42142884 #>>42144506 #>>42145678 #>>42147347 #

10. cco ◴[15 Nov 24 00:32 UTC] No.42142835{3}[source]▶

>>42142146 #

We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.

There is no advantage to tokenization, it just helps solve limitations in context windows and training.

replies(1): >>42143249 #

11. aithrowawaycomm ◴[15 Nov 24 00:40 UTC] No.42142884{3}[source]▶

>>42142807 #

I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923

replies(1): >>42143144 #

12. aithrowawaycomm ◴[15 Nov 24 00:44 UTC] No.42142913{3}[source]▶

>>42142733 #

I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923

replies(2): >>42143402 #>>42150368 #

13. cma ◴[15 Nov 24 01:34 UTC] No.42143144{4}[source]▶

>>42142884 #

Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.

14. jncfhnb ◴[15 Nov 24 01:43 UTC] No.42143197[source]▶

>>42141993 (TP) #

There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge

replies(1): >>42144149 #

15. TZubiri ◴[15 Nov 24 01:50 UTC] No.42143239[source]▶

>>42142384 #

FWIW I think most of the "tokenization problems"

List of actual tokenizarion limitations 1- strawberry 2- rhyming and metrics 3- whitespace (as displayed in the article)

16. TZubiri ◴[15 Nov 24 01:51 UTC] No.42143246{4}[source]▶

>>42142199 #

This is just more tokens?

Yup. Just let the actual ML git gud

replies(1): >>42143334 #

17. TZubiri ◴[15 Nov 24 01:52 UTC] No.42143249{4}[source]▶

>>42142835 #

I like this explanation

18. taeric ◴[15 Nov 24 02:08 UTC] No.42143334{5}[source]▶

>>42143246 #

So, put differently, this is just more expensive?

replies(1): >>42152806 #

19. layer8 ◴[15 Nov 24 02:09 UTC] No.42143338[source]▶

>>42141993 (TP) #

Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.

20. og_kalu ◴[15 Nov 24 02:17 UTC] No.42143381[source]▶

>>42141993 (TP) #

Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.

21. ipsum2 ◴[15 Nov 24 02:20 UTC] No.42143402{4}[source]▶

>>42142913 #

What you're saying is an explanation what I said, but I agree with you ;)

replies(1): >>42148535 #

22. meroes ◴[15 Nov 24 03:46 UTC] No.42143800[source]▶

>>42142384 #

At a certain level they are identical problems. My strongest piece of evidence is that I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems, with multiple models, etc?

My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.

replies(1): >>42149054 #

23. ATMLOTTOBEER ◴[15 Nov 24 04:53 UTC] No.42144059[source]▶

>>42141993 (TP) #

I tend to agree with you. Your post reminded me of https://gwern.net/aunn

replies(1): >>42177854 #

24. joquarky ◴[15 Nov 24 05:19 UTC] No.42144149[source]▶

>>42143197 #

It's not even possible to encode all forms of knowledge.

replies(1): >>42145431 #

25. numpad0 ◴[15 Nov 24 05:34 UTC] No.42144207[source]▶

>>42141993 (TP) #

hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English

replies(2): >>42148388 #>>42150181 #

26. Jensson ◴[15 Nov 24 06:55 UTC] No.42144506{3}[source]▶

>>42142807 #

We know there are narrow solutions to these problems, that was never the argument that the specific narrow task is impossible to solve.

The discussion is about general intelligence, the model isn't able to do a task that it can do simply because it chooses the wrong strategy, that is a problem of lack of generalization and not a problem of tokenization. Being able to choose the right strategy is core to general intelligence, altering input data to make it easier for the model to find the right solution to specific questions does not help it become more general, you just shift what narrow problems it is good at.

27. empiko ◴[15 Nov 24 07:12 UTC] No.42144582[source]▶

>>42141993 (TP) #

I have seen a bunch of tokenization papers with various ideas but their results are mostly meh. I personally don't see anything principally wrong with current approaches. Having discrete symbols is how natural language works, and this might be an okayish approximation.

28. csomar ◴[15 Nov 24 07:17 UTC] No.42144596[source]▶

>>42142384 #

It can count words in a paragraph though. So I do think it's tokenization.

29. malthaus ◴[15 Nov 24 07:19 UTC] No.42144600[source]▶

>>42141993 (TP) #

https://youtu.be/zduSFxRajkE

karpathy agrees with you, here he is hating on tokenizers while re-building them for 2h

30. shaky-carrousel ◴[15 Nov 24 10:00 UTC] No.42145431{3}[source]▶

>>42144149 #

I know a joke where half of the joke is whistling and half gesturing, and the punchline is whistling. The wording is basically just to say who the players are.

31. azeirah ◴[15 Nov 24 10:46 UTC] No.42145678{3}[source]▶

>>42142807 #

I strongly believe that the problem isn't that tokenization isn't the underlying problem, it's that, let's say bit-by-bit tokenization is too expensive to run at the scales things are currently being ran at (openai, claude etc)

replies(1): >>42150150 #

32. blixt ◴[15 Nov 24 10:54 UTC] No.42145725[source]▶

>>42141993 (TP) #

I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).

I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).

[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/

33. Hendrikto ◴[15 Nov 24 11:22 UTC] No.42145899{4}[source]▶

>>42142199 #

No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626

replies(1): >>42145925 #

34. taeric ◴[15 Nov 24 11:25 UTC] No.42145925{5}[source]▶

>>42145899 #

More tokens to a sequence, though. And since it is learning sequences...

replies(1): >>42147484 #

35. PittleyDunkin ◴[15 Nov 24 12:48 UTC] No.42146419[source]▶

>>42141993 (TP) #

A byte is itself sort of a token. So is a bit. It makes more sense to use more tokenizers in parallel than it does to try and invent an entirely new way of seeing the world.

Anyway humans have to tokenize, too. We don't perceive the world as a continuous blob either.

replies(1): >>42148069 #

36. PittleyDunkin ◴[15 Nov 24 12:48 UTC] No.42146428[source]▶

>>42142384 #

I feel like we can set our qualifying standards higher than counting.

37. Anotheroneagain ◴[15 Nov 24 12:50 UTC] No.42146444[source]▶

>>42141993 (TP) #

I think on the contrary, the more you can restrict it to reasonable inputs/outputs, the less powerful LLM you are going to need.

38. pmarreck ◴[15 Nov 24 14:43 UTC] No.42147347{3}[source]▶

>>42142807 #

My intuition says that tokenization is a factor especially if it splits up individual move descriptions differently from other LLM's

If you think about how our brains handle this data input, it absolutely does not split them up between the letter and the number, although the presence of both the letter and number together would trigger the same 2 tokens I would think

39. loa_in_ ◴[15 Nov 24 14:58 UTC] No.42147484{6}[source]▶

>>42145925 #

Yeah, suddenly 16k tokens is just 16kb of ASCII instead of ~6kwords

40. samatman ◴[15 Nov 24 16:00 UTC] No.42148069[source]▶

>>42146419 #

I would say that "humans have to tokenize" is almost precisely the opposite of how human intelligence works.

We build layered, non-nested gestalts out of real time analog inputs. As a small example, the meaning of a sentence said with the same precise rhythm and intonation can be meaningfully changed by a gesture made while saying it. That can't be tokenized, and that isn't what's happening.

replies(1): >>42149700 #

41. umanwizard ◴[15 Nov 24 16:28 UTC] No.42148388[source]▶

>>42144207 #

Why couldn’t Chinese characters accurately represent English? Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

replies(2): >>42150302 #>>42150512 #

42. aithrowawaycomm ◴[15 Nov 24 16:45 UTC] No.42148535{5}[source]▶

>>42143402 #

No, it's a rebuttal of what you said: CoT is not making up for a deficiency in tokenization, it's making up for a deficiency in transformers themselves. These complexity results have nothing to do with tokenization, or even LLMs, it is about the complexity class of problems that can be solved by transformers.

replies(1): >>42150513 #

43. 1propionyl ◴[15 Nov 24 17:39 UTC] No.42149054{3}[source]▶

>>42143800 #

> hundreds of thousands of RLHF'ers through dozens of third party companies

Out of curiosity, what are these companies? And where do they operate.

I'm always interested in these sorts of "hidden" industries. See also: outsourced Facebook content moderation in Kenya.

replies(1): >>42159108 #

44. ajkjk ◴[15 Nov 24 18:07 UTC] No.42149355[source]▶

>>42141993 (TP) #

This is probably unnecessary, but: I wish you wouldn't use the word "stupid" there. Even if you didn't mean anything by it personally, it might reinforce in an insecure reader the idea that, if one can't speak intelligently about some complex and abstruse subject that other people know about, there's something wrong with them, like they're "stupid" in some essential way. When in fact they would just be "ignorant" (of this particular subject). To be able to formulate those questions at all is clearly indicative of great intelligence.

replies(1): >>42150834 #

45. PittleyDunkin ◴[15 Nov 24 18:50 UTC] No.42149700{3}[source]▶

>>42148069 #

What is a gestalt if not a token (or a token representing collections of other tokens)? It seems more reasonable (to me) to conclude that we have multiple contradictory tokenizers that we select from rather than to reject the concept entirely.

> That can't be tokenized

Oh ye of little imagination.

46. int_19h ◴[15 Nov 24 19:39 UTC] No.42150150{4}[source]▶

>>42145678 #

It's not just a current thing, either. Tokenization basically lets you have a model with a larger input context than you'd otherwise have for the given resource constraints. So any gains from feeding the characters in directly have to be greater than this advantage. And for CoT especially - which we know produces significant improvements in most tasks - you want large context.

47. int_19h ◴[15 Nov 24 19:43 UTC] No.42150181[source]▶

>>42144207 #

You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.

But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.

replies(1): >>42155255 #

48. stickfigure ◴[15 Nov 24 19:58 UTC] No.42150302{3}[source]▶

>>42148388 #

If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.

But the residents of England do in fact speak English, and English is a phonetic language, so there's an inherent impedance mismatch between Chinese characters and English language. I can make up words in English and write them down which don't necessarily have Chinese written equivalents (and probably, vice-versa?).

replies(1): >>42150670 #

49. MacsHeadroom ◴[15 Nov 24 20:05 UTC] No.42150368{4}[source]▶

>>42142913 #

This paper does not support your position any more than it supports the position that the problem is tokenization.

This paper posits that if the authors intuition was true then they would find certain empirical results. ie. "If A then B." Then they test and find the empirical results. But this does not imply that their intuition was correct, just as "If A then B" does not imply "If B then A."

If the empirical results were due to tokenization absolutely nothing about this paper would change.

50. skissane ◴[15 Nov 24 20:21 UTC] No.42150512{3}[source]▶

>>42148388 #

> Japanese and Korean aren’t related to Chinese and still were written with Chinese characters (still are in the case of Japanese).

The problem is – in writing Japanese with kanji, lots of somewhat arbitrary decisions had to be made. Which kanji to use for which native Japanese word? There isn't always an obviously best choice from first principles. But that's not a problem in practice, because a tradition developed of which kanjii to use for which Japanese word (kun'yomi readings). For English, however, we don't have such a tradition. So it isn't clear which Chinese character to use for each English word. If two people tried to write English with Chinese characters independently, they'd likely make different character choices, and the mutual intelligibility might be poor.

Also, while neither Japanese nor Korean belongs to the same language family as Chinese, both borrowed lots of words from Chinese. In Japanese, a lot of use of kanji (especially on'yomi reading) is for borrowings from Chinese. Since English borrowed far less terms from Chinese, this other method of "deciding which character(s) to use" – look at the word's Chinese etymology – largely doesn't work for English given very few English words have Chinese etymology.

Finally, they also invented kanji in Japan for certain Japanese words – kokuji. The same thing happened for Korean Hanja (gukja), to a lesser degree. Vietnamese Chữ Nôm contains thousands of invented-in-Vietnam characters. Probably, if English had adopted Chinese writing, the same would have happened. But again, deciding when to do it and if so how is a somewhat arbitrary choice, which is impossible outside of a real societal tradition of doing it.

> The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.

Using the Latin alphabet changed English, just as using Chinese characters changed Japanese, Korean and Vietnamese. If English had used Chinese characters instead of the Latin alphabet, it would be a very different language today. Possibly not in grammar, but certainly in vocabulary.

51. ipsum2 ◴[15 Nov 24 20:21 UTC] No.42150513{6}[source]▶

>>42148535 #

There's a really obvious way to test whether the strawberry issue is tokenization - replace each letter with a number, then ask chatGPT to count the number of 3s.

Count the number of 3s, only output a single number: 6 5 3 2 8 7 1 3 3 9.

ChatGPT: 3.

52. umanwizard ◴[15 Nov 24 20:34 UTC] No.42150670{4}[source]▶

>>42150302 #

> If I read you correctly, you're saying "the fact that the residents of England speak English instead of Chinese is a historical accident" and maybe you're right.

That’s not what I mean at all. I mean even if spoken English were exactly the same as it is now, it could have been written with Chinese characters, and indeed would have been if England had been in the Chinese sphere of cultural influence when literacy developed there.

> English is a phonetic language

What does it mean to be a “phonetic language”? In what sense is English “more phonetic” than the Chinese languages?

> I can make up words in English and write them down which don’t necessarily have Chinese written equivalents

Of course. But if English were written with Chinese characters people would eventually agree on characters to write those words with, just like they did with all the native Japanese words that didn’t have Chinese equivalents but are nevertheless written with kanji.

Here is a famous article about how a Chinese-like writing system would work for English: https://www.zompist.com/yingzi/yingzi.htm

replies(2): >>42154464 #>>42155104 #

53. volkk ◴[15 Nov 24 20:48 UTC] No.42150834[source]▶

>>42149355 #

> This is probably unnecessary

you're certainly right

replies(1): >>42175212 #

54. amelius ◴[15 Nov 24 21:03 UTC] No.42151016[source]▶

>>42141993 (TP) #

Perhaps we can even do away with transformers and use a fully connected network. We can always prune the model later ...

55. TZubiri ◴[16 Nov 24 00:03 UTC] No.42152806{6}[source]▶

>>42143334 #

Expensive in terms of computationally expensive, time expensive, and yes cost expensive.

Worth noting that the relationship between characters to token ratio is probably quadratic or cubic or some other polynomial. So the difference in terms of computational difficulty is probably huge when compared to a character per token.

56. stickfigure ◴[16 Nov 24 05:10 UTC] No.42154464{5}[source]▶

>>42150670 #

> In what sense is English “more phonetic” than the Chinese languages?

Written English vs written Chinese.

How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth? These words have subtle, humorous, and entertaining meanings by way of twisting the sounds of other existing words. Shakespeare was a master of this kind of wordplay and invented a surprising number of words we use today.

I'm not saying you can't have the same phenomenon in spoken Chinese. But how do you write it down without a phonetic alphabet? And if you can't write it down, how do you share it to a wide audience?

replies(1): >>42154867 #

57. umanwizard ◴[16 Nov 24 06:39 UTC] No.42154867{6}[source]▶

>>42154464 #

> How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth?

With Chinese characters, of course. Why wouldn’t you be able to?

In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.

> But how do you write it down without a phonetic alphabet?

In general to write a newly coined word you would repurpose characters that sound the same as the newly coined word.

Every syllable that can possibly be uttered according to mandarin phonology is represented by some character (usually many), so this is always possible.

---

Regardless, to reiterate the original point: I'm not claiming Chinese characters are better or more flexible than alphabetic writing. They're not. I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is (other than the fact that a lot of its vocabulary comes from Chinese, but that's not a real counterpoint given that there is lots of native, non-Chinese-derived vocabulary that's still written with kanji).

It would be possible to write Japanese entirely in the Latin alphabet, or English entirely with some system similar to Chinese characters, with minimal to no change to the structure of the language.

replies(2): >>42155215 #>>42162019 #

58. numpad0 ◴[16 Nov 24 07:42 UTC] No.42155104{5}[source]▶

>>42150670 #

"Donald Trump" in CJK, taken from Wikipedia page URL and as I hear it - each are close enough[1] and natural enough in each respective languages but none of it are particularly useful for counting R in strawberry:

  C: 唐納·川普, "Thangnar Changpooh"  
  J: ドナルド・トランプ, "Donaludo Toranpu"  
  K: 도널드 트럼프, "D'neldeh Tlempeuh"

> What does it mean to be a “phonetic language”?

Means the script is intended to record pronunciation rather than intention, e.g. it's easy to see how "cow" is intended to be pronounced but it's not necessarily clear what a cow is; ideographic script on the other hand focuses on meaning, e.g. "魚" is supposed to look like a fish but pronunciation varies from "yueh", "sakana", "awe", etc.

1: I tried looking up other notable figures, but thought this person having entertainment background tends to illustrate the point more clearly

59. numpad0 ◴[16 Nov 24 08:14 UTC] No.42155215{7}[source]▶

>>42154867 #

> I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is

what? No, anything but IPA(only technically) and that language's native writings work for pronunciations. Hiragana, Hangul, or Chữ Quốc Ngữ, would not exist otherwise.

e: would _not_ exist

replies(1): >>42155340 #

60. numpad0 ◴[16 Nov 24 08:27 UTC] No.42155255{3}[source]▶

>>42150181 #

I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.

And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.

replies(1): >>42159182 #

61. umanwizard ◴[16 Nov 24 08:51 UTC] No.42155340{8}[source]▶

>>42155215 #

Then why are both English and Latin represented with Latin characters despite having a completely different phoneme inventory?

replies(1): >>42156141 #

62. numpad0 ◴[16 Nov 24 12:47 UTC] No.42156141{9}[source]▶

>>42155340 #

Because one is distant ancestor of the other...? It never adopted writing system from outside. The written and spoken systems co-evolved from a clean slate.

replies(1): >>42158827 #

63. umanwizard ◴[16 Nov 24 20:03 UTC] No.42158827{10}[source]▶

>>42156141 #

That’s not true. English is not a descendant of Latin, and the Latin alphabet was adopted from the outside, replacing Anglo-Saxon runes (also called the Futhorc script).

Just like kanji are not native to Japanese.

64. meroes ◴[16 Nov 24 20:38 UTC] No.42159108{4}[source]▶

>>42149054 #

Scale AI is a big one who owns companies who do this as well, such as Outlierai.

There are many other AI trainer job companies though. A lot of it is gig work but the pay is more than the vast majority of gig jobs.

65. int_19h ◴[16 Nov 24 20:48 UTC] No.42159182{4}[source]▶

>>42155255 #

I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.

66. stickfigure ◴[17 Nov 24 04:45 UTC] No.42162019{7}[source]▶

>>42154867 #

> In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.

Nonsense. There is zero chance in hell that if you combine the pictographs for "thing", "a", "ma", and "gibberish", that someone reading that is going to reproduce the sound thingamajibber. It just does not work. The meme does not replicate.

There may be other virtues of pictographic written language, but reproducing sounds is not one of them. And - as any Shakespeare fan will tell you - tweaking the sounds of English cleverly is rather important. If you can't reproduce this behavior, you're losing something in translation. So to speak.

replies(1): >>42162107 #

67. umanwizard ◴[17 Nov 24 05:04 UTC] No.42162107{8}[source]▶

>>42162019 #

Chinese characters aren't pictographs, so whether English could be written with pictographs is irrelevant to this discussion.

Each Chinese character represents a syllable (in Chinese languages) or a small set of possible sequences of syllables (in Japanese).

And yes, in Chinese languages, new words are created from characters that sound like the parts of the new word, all the time.

68. ajkjk ◴[18 Nov 24 18:17 UTC] No.42175212{3}[source]▶

>>42150834 #

Well, I'm still glad I posted it, since I do care about it.

69. gwern ◴[18 Nov 24 22:36 UTC] No.42177854[source]▶

>>42144059 #

One neat thing about the AUNN idea is that when you operate at the function level, you get sort of a neural net version of lazy evaluation; in this case, because you train at arbitrary indices in arbitrary datasets you define, you can do whatever you want with tokenization (as long as you keep it consistent and don't retrain the same index with different values). You can format your data in any way you want, as many times as you want, because you don't have to train on 'the whole thing', any more than you have to evaluate a whole data structure in Haskell; you can just pull the first _n_ elements of an infinite list, and that's fine.

So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.

You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)

↑