(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0.692s | source

Show context

andy99 ◴[24 Jun 25 17:11 UTC] No.44368430[source]▶

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

replies(7): >>44368463 #>>44369041 #>>44369608 #>>44370115 #>>44370128 #>>44374874 #>>44395946 #

yuvalpinter ◴[27 Jun 25 11:34 UTC] No.44395946[source]▶

>>44368430 #

We have a paper under review that's gonna be up on arXiv soon, where we test this for ~10,000 words and find consistent decline in counting ability based on how many characters are in the tokens where the target character appears. It seems that models know "which character" is a single-character token but really doesn't get much about the inner composition of multi-character tokens.

replies(1): >>44397892 #

1. hnaccount_rng ◴[27 Jun 25 16:07 UTC] No.44397892[source]▶

>>44395946 #

Isn't that a rather trivial result? Or at least expected? Unless you manually encode the "this token consists of those tokens" information those are completely independent things for the model?

↑

The bitter lesson is coming for tokenization