←back to thread

233 points JnBrymn | 2 comments | | HN request time: 0.007s | source
1. ianbutler ◴[] No.45676963[source]
https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

replies(1): >>45677374 #
2. scotty79 ◴[] No.45677374[source]
I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.