←back to thread

The Tradeoffs of SSMs and Transformers

(goombalab.github.io)

64 points jxmorris12 | 1 comments | 08 Jul 25 19:12 UTC | HN request time: 0.201s | source

Show context

macleginn ◴[08 Jul 25 20:58 UTC] No.44504019[source]▶

>>44503056 (OP) #

The part on tokenisation is not very convincing. Replacing BPE with characters or even bytes will not "remove tokenisation" -- atoms will still be tokens, relating to different things in different cultures/writing traditions (a "Chinese byte" is a part of a Chinese character; an "English byte" is basicaly a letter or a number) and not relating to something fundamentally linguistic. BPE can be thought of as another way of representing linguistic sequences with symbols of some kind; it provides less inductive bias into the use of language, but it is not perhaps categorically different from any kind of writing.

replies(1): >>44506672 #

1. aabhay ◴[09 Jul 25 05:45 UTC] No.44506672[source]▶

The point is not that tokenization is irrelevant, its that the transformer model _requires_ information dense inputs, which is derived by compressing the input space from raw characters to subwords. Give it something like raw audio or video frames, and its capabilities dramatically bottom out. That’s why even todays sota transformer models heavily preprocess media input, even going as far as doing lightweight frame importance sampling to extract the “best” parts of the video.

In the future, all of these tricks may seem quaint. “Why don’t you just pass the raw bits of the camera feed straight to the model layers?” we may say.