(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0s | source

Show context

andy99 ◴[24 Jun 25 17:11 UTC] No.44368430[source]▶

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

replies(7): >>44368463 #>>44369041 #>>44369608 #>>44370115 #>>44370128 #>>44374874 #>>44395946 #

krackers ◴[24 Jun 25 18:56 UTC] No.44369608[source]▶

>>44368430 #

Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.

replies(3): >>44369975 #>>44371050 #>>44372266 #

1. anonymoushn ◴[24 Jun 25 23:46 UTC] No.44372266[source]▶

>>44369608 #

There are various papers about this, maybe most prominently Byte-Latent Transformer.

↑

The bitter lesson is coming for tokenization