DeepSeek OCR | slacker news

It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.

In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).

For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.

Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.