The part on tokenisation is not very convincing. Replacing BPE with characters or even bytes will not "remove tokenisation" -- atoms will still be tokens, relating to different things in different cultures/writing traditions (a "Chinese byte" is a part of a Chinese character; an "English byte" is basicaly a letter or a number) and not relating to something fundamentally linguistic. BPE can be thought of as another way of representing linguistic sequences with symbols of some kind; it provides less inductive bias into the use of language, but it is not perhaps categorically different from any kind of writing.
replies(1):