It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.
With this LLM approach you can at least create your training data from the raw images with natural language.
>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.
(I guess you could say a picture token is worth 10 textual tokens...)
Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?
And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.
Maybe they would render texts to an image before tokenizing to reduce the compute cost.
So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.
It will never be as precise as textual tokens but it can be really good as they show in the paper.
TLDR: It's MIT licensed
>先天下之忧而忧
How is this an example of a prompt?Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."
Can anyone shed some light on this saying or why it's in the article?
Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple vision tokens. So assuming that the embedding size of the vision token and text token is the same `d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor of 10 as quoted in the paper).
That said, if I do experimentally try this by shrinking this very comment down to the smallest font size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it can't be as simple as "each vision token can cover more text", since that only enables better compression if the encoder can actually uncover some sort of redundancy within each token. (And presumably the type of redundancy it uncovers probably isn't something that "classical" compression techniques can exploit, otherwise it seems like it would have been tried by now?).
Both translations don't catch the meaning well though. It means: "worry before the rest of the world (notice that they have something to) worry." The next part is 後天下之樂而樂("be happy only after the rest of the world is happy.")
I don't know why it's a prompt example.
https://getomni.ai/blog/ocr-benchmark (Feb 2025)
Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.
But I think it's still experimentall.
> 先天下之忧而忧,后天下之乐而乐
> (put the world's worries before yours, and put your happiness after the world's) > edit: this translation is wrong, and raincole has a definitely better translation
Since the model is a language model, they probably use this to demonstrate the model's language capabilities – the model should be able to complete the whole sentence pair. The paper also mentions this:
> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.
So I believe it is just a text-only demonstration.
后天下之乐而乐
which one is correct?
後天下之樂而樂
Which one is correct?
The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.
I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.
I'd be interested to hear where OCR still struggles today.
https://chatgpt.com/share/68f5f9ba-d448-8005-86d2-c3fbae028b...
Edit: Just caught a mistake, transcribed one of the prices incorrectly.
I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.
Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip
I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist
But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.
For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.
a) 后天下之乐而乐
b) 後天下之樂而樂
c) 後天下之楽而楽
a) is clearly Simplified Chinese from a sibling comment, b) is Traditional copied from your comment, and c) is as I just typed in my own language. Unicode Hanzi/Kanji are a mess and there are characters same or different, in appearance or in binary, depending on intended variants, languages, fonts, systems, keyboard, distance between Earth and Alpha Centauri, etc.Or at least that kind of thing would motivate them to re-implement OCR with LLM.
Just get a DEF 14A (Annual meeting) filing of a company from SEC EDGAR.
I have seen so many mistakes when looking at the result closely.
Here is a DEF 14A filing from Salseforce. You can print it to a PDF and then try converting.
https://www.sec.gov/Archives/edgar/data/1108524/000110852425...
So you either have to be fine with a lot of uncertainty as to the accuracy of that interpretation or you have to wait for an LLM that can do it in a completely reproducible way every time.
Lots of words have multiple meanings and can mean different things even if used in the same sentence/context just from the interpretation of the person reading it.
Heck, it'd argue that most (not all) dayjob conflicts are down to such differences in interpretation /miscommunications
The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.
Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.
Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.
This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.
You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.
There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.
And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.
There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.
"I'm studying at Oxford Univ" has basically no loss in meaning even though "University" was truncated to less than half its characters.
- the OmniAI benchmark is bad
- Instead check OmniDocBench[1] out
- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini
- End to End OCR is still extremely tricky
- composed pipelines work better (layout detection -> reading order -> OCR every element)
- complex table parsing is still extremely difficult
I use ocrmypdf (which uses Tesseract). Runs locally and is absolutely fantastic. https://ocrmypdf.readthedocs.io/en/latest/
As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.
Do people usually recognize all variants as valid and legible? Or does any particular set of letters/symbols prevail in practice?
It's a hard (and very interesting) problem space.
If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the LLM.
On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well, it'd be far more than the ~100k entries used by popular LLMs, and that has some computational tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that for each token, as well as increasing the size of certain layers within the LLM (more memory + compute required for each token, basically).
Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run into specific tokens that only appear a handful of times in the training data and the model is never able to fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific example being somebody's username on the internet.)
Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who works in AI security)
As LLMs have improved, models have pushed towards the largest vocabulary they can get away with without hurting performance. This is about where my knowledge on the subject ends, but there have been many analyses done to try to compute the optimal vocabulary size. (See the links below)
One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a more dense, almost compressed latent space that the large LLM can understand).
It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...
This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just network inspector. I don't need access to the underlying Protobuf files to understand what each byte means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P
Exciting times!
0: https://www.rohan-paul.com/p/tutorial-balancing-vocabulary-s...
1: https://arxiv.org/html/2407.13623v1
2: https://en.wikipedia.org/wiki/Glitch_token
3: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
As mentioned above I have about 200 construction invoices. They are all formatted in a way that doesn't make sense. Most fail both OCR and OpenAI
I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail to be able to accurately perform OCR on it.
It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.
disclaimer: I do not work for tensorlake—but i know the folks behind it.
Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.
It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.
Take a lowercase a in English for example. This font writes it differently than a child. Or in cursive. Or probably than you would write it. But you recognize all of them and don’t really think about it.
Can you explain more about your setup ? I have a quarter million pages I want to OCR.
Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.
It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.
As mentioned though, the LLMs are usually better at avoiding character substitutions, but worse at consistency across the entire page. (Just like a non-OCR LLM, they can and will go completely off the rails.)
Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.
Spreadsheets are not the biggest problem though, as they have a reliable 2-dimensional grid - at worst some cells will be combined. The form layouts and n-dimensional table structures you can find on medical and insurance documents are truly unhinged. I've seen documents that I struggled to interpret.
Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.
> One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8)
I've had similar ideas in the past. High level languages that humans write are designed for humans. What does an "LLM native" programming language look like? And, to your point about protobufs vs JSON, how does a human debug it when the LLM gets stuck?
> It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...
That's basically the strategy for Claude's new "Skills" feature, just in a more dynamic/AI driven way. Claude will do semantic search through YAML frontmatter to determine what skill might be useful in a given context, then load that entire skill file into context to execute it. Your idea here is similar, use a small local model to summarize each file (basically dynamically generate that YAML front matter), feed those into the larger model's context, and then it can choose which file(s) it cares about based on that.
Fixed layout and lack of semantic structure in PDFs.
Non-linear text flow due to columns, sidebars, or images.
Position-based text without contextual or relational markers.
Absence of standard structure tags (like in HTML).
Scanned or image-based PDFs requiring OCR.
Preprocessing needs for scanned PDFs (noise, rotation, skew).
Extracting tables from unstructured or visually complex layouts.
Multi-column and fancy layouts breaking semantic text order.
Background images and watermarks interfering with text extraction.
Handwritten text recognition challenges.
[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...
> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper
It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?
So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.
The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.
our solution so far has been to stick to using tesseract with good clean-up routines and then augmenting/fixing-up the output using the VLM OCR text where we don't have structured source document data available
it could be that we just have a very niche use-case and it doesn't matter to most people, I'm sure if you just want a text dump or restructured markdown/html representation of documents these VLMs work well but the number of articles & comments I've seen claiming that these models have 'solved' OCR just seems counter to our experiences
In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).
For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.
Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.
The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.
It would also be nice for there to be a good competitor.
My main question really is: what are practical OCR tools that I can string together on my MacBook Pro M1 Max w/ 64GB Ram to maximize OCR quality for lots of mail and schoolwork coming into my house, all mostly in English.
I use ScanSnap Manager with its built in OCR tools, but that's probably super outdated by now. Apple Vision does way better job than that. I heard people say also that Apple Vision is better than Tesseract. But is there something better still that's also practical to run in a scripted environment on my machine?
It's a common pasttime for programmers to claim that our textual programming languages are just terrible and need to be replaced somehow with something visual, but I think this very often comes from a place of not understanding just how amazing textual languages are. Not they couldn't possibly be improved by something in at least some domains, and there are after all some successful niches for visual languages, but I think if you set out to wholesale replace textual languages without an understanding of and appreciation for the impressive nature of the competition they offer you're setting yourself up to fail.
Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.
Google's image quality on uploads is still streets ahead of openai for instance btw.
To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.
This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.
To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.
A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.
Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.
But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.
This is indeed amazing. It's actually how human try to understand and remember things. BY VISUAL! And when memory fade out, the image are getting blurred.
Not sure if those close source multimodal models are already using this method.
(I imagine it also had the benefit of reducing fraud/errors).
In this day and age, it's probably easier/better to change the process around that as there's little excuse for such shit quality input. I understand this isn't always possible though.
My point is, I don’t think any comparative benchmark would ever exclude something based on “oh it’s just 10%, who cares.” I think the issue is more that Apple Vision Framework is not well known as an OCR option, but maybe it’s starting to change.
And another part of the irony is that Apple’s framework probably gets way more real world usage in practice than most of the tools in that benchmark.
The number of bits to represent a text or vision token is the same, since they are both represented as embeddings of a fixed number of dimensions defined by the Transformer (maybe a few thousand for a large SOTA model).
Whether a vision token actually contains enough information to accurately extract (OCR) all the text data from that portion of the image is going to depend on how many pixels that vision token represents and how many words were present in that area of the image. It's just like considering images of the same page of text at different resolutions - a 1024x1024 image vs a 64x64 one, etc. As the resolution decreases so will OCR accuracy. At some point the resolution is insufficient and the words become a blurry mess and OCR accuracy suffers.
This is what DeepSeek are reporting - OCR accuracy if you try to use a single vision token to represent, say, 10 text tokens, vs 20 text tokens. The vision token may have enough resolution to represent 10 tokens well, but not enough for 20.
So I'm not saying it should be excluded because it's can only used by relatively few people, but I was trying to communicate that I kind of get why not so many people care about it and why it gets forgotten, since most people wouldn't be able to run it even if they wanted to.
Instead, something like DeepSeek OCR could be deployed on any of the three major OSes (assuming there is implementations of the architecture available), so of course it gets a lot more attention and will be included in way more benchmarks.
So close but it should be 2025/X/XX as "X" = 10 in Roman Numerals /s
Jokes aside, this is really neat and I'm looking forward to getting this running. For most OCR-type stuff I just use AWS Textract since I need it so rarely and that service does a decent job. I really like how well this model seems to extract images/figures as well from the original document.
https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...
Here's a result I got https://github.com/simonw/research/blob/main/deepseek-ocr-nv... - against this image: https://static.simonwillison.net/static/2025/ft.jpeg
How do you get the "as root" part of that to work?
(sorry if it's explained in your article)
Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
I hope that this capability improves so I can use only Gemini API.
And yes, Ansible (among other tools) can be used to manage macOS.
This discussion doesn’t seem productive. You have a preconceived view point, and you’re not actually considering the problem or even doing 5 seconds of googling.
Managing a Mac fleet on AWS isn’t a real problem. If Apple’s OCR framework were significantly above the competition, it could easily be used. I would like to see benchmarks of it, as the other person was also asking for.
It does not provide bounding boxes but you can get text.
E.g.
Most current benchmarks have a scoring scheme of
1 - Correct Answer 0 - No answer or incorrect answer
But what they need is something more like
1 - Correct Answer 0.25 - No answer 0 - Incorrect answer
You need benchmarks (particularly those used in training) to incentivize the models to acknowledge when they're uncertain.
But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.
Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.
The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.
Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.
Gemini is definitely the best model for OCR. But it has a really high rate of "recitation" errors. Where it will determine the output token is too close to its training data and cut it off. Something like 10% of the time from our testing. Also it has this hilarious hallucination when you have a blank page in the document mix and it just makes up new info.
OpenAI is OK. GPT5 wasn't any better than 4o or 4.1. Main issues were: dropping content like headers/footers, loses it's mind on sideways pages, and will frequently refuse to read things like ID documents, health care forms, or things it judges to have too much PII.
But not for semantic document structure — recognizing that the grammatically incomplete phrase in a larger font is a heading, recognizing subheadings and bullet lists, tables, etc.
Also not for handwritten text, text inside of images (signage and so forth), or damaged source material (old photocopies and scans created in the old days).
Those areas all seem to me where an LLM-based approach could narrow the gap between machine recognition and humans. You have to sort of reason about it from the context as a human to figure it out, too.
Whereas LLM's make the opposite trade-off. There are information centric theory limitations on the amount of information LM's can store (roughly 3.6 bits per parameter) so they aggressively compress information and trade away nuance (https://arxiv.org/abs/2505.17117).
b) Traditional Chinese
c) 楽 is a variation of 樂, which is now widely used in Japanese Kanji but deprecated in Traditional Chinese.
Note:
A variation means some people write 樂 as 楽 in ancient China, but not widely adopted.
Kanji is a Japanese word, means "Chinese Character".
Macau, HK and Taiwan uses traditional Chinese character.
Mainland China, Singapore, Malaysia use simplified Chinese character.
Japan uses its own version, some simplified, some traditional, and also invented over 100 Japanese-made-Kanji following the same logic how Chinese characters are formed.
As a matter of fact, simplification of Chinese characters started when KMT/Republic of China was in control of the whole China. Politics gets in the way later and RoC stopped this simplification process while PRC kept it going, Macau & HK were not involved since the Portuguese and British colonial government doesn't care. Singapore and Malaysia pick the simplified version out of convenience.
Ownership laundering.
Correct?
We tend to think in images rather than plaintext, and here we are discovering it's more efficient for a computer to do so as well.
Another takeaway is that you don’t need to pass a tensor of shape (batch_size, sequence_length, d_model) through your transformer. Not every token needs its own dedicated latent embedding. You can presumably get away with dividing sequence_length by a constant.
This isn’t super ground breaking but it does reinforce the validity of a middle ground between recurrent models, where context is compressed into a single “memory token”, and transformers, where context is uncompressed. 1 < n/k < n
I’ve been exploring Git activity analysis recently and ran into similar trade-offs: how do you tokenize real-world code and avoid counting noise?
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.
That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.
[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902
Already at move 3 you have bilions of positions possible