←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 1 comments | | HN request time: 0s | source
Show context
pietz ◴[] No.45641449[source]
My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

replies(23): >>45641470 #>>45641479 #>>45641533 #>>45641536 #>>45641612 #>>45641806 #>>45641890 #>>45641904 #>>45642270 #>>45642699 #>>45642756 #>>45643016 #>>45643911 #>>45643964 #>>45644404 #>>45644848 #>>45645032 #>>45645325 #>>45646756 #>>45647189 #>>45647776 #>>45650079 #>>45651460 #
cahaya ◴[] No.45641806[source]
Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

replies(3): >>45641970 #>>45642001 #>>45645000 #
CaptainOfCoit ◴[] No.45642001[source]
> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.

replies(3): >>45642072 #>>45643869 #>>45643966 #
kmacdough ◴[] No.45643966[source]
> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.

replies(1): >>45644127 #
1. CaptainOfCoit ◴[] No.45644127[source]
Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and structure". It is true that some OCR systems can handle trying to replicate the structure, but still today I think that's the exception, not the norm.

Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.