←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 2 comments | | HN request time: 0s | source
Show context
pietz ◴[] No.45641449[source]
My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

replies(23): >>45641470 #>>45641479 #>>45641533 #>>45641536 #>>45641612 #>>45641806 #>>45641890 #>>45641904 #>>45642270 #>>45642699 #>>45642756 #>>45643016 #>>45643911 #>>45643964 #>>45644404 #>>45644848 #>>45645032 #>>45645325 #>>45646756 #>>45647189 #>>45647776 #>>45650079 #>>45651460 #
kbumsik ◴[] No.45641470[source]
> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

replies(2): >>45641501 #>>45643548 #
coulix ◴[] No.45641501[source]
This.

Any complex parent table span cell relationship still has low accuracy.

Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

They will fail.

replies(2): >>45641541 #>>45641916 #
pietz ◴[] No.45641916[source]
Maybe my imagination is limited or our documents aren't complex enough, but are we talking about realistic written documents? I'm sure you can take a screenshot of a very complex spreadsheet and it fails, but in that case you already have the data in structured form anyway, no?
replies(2): >>45642356 #>>45644170 #
kbumsik ◴[] No.45642356[source]
> realistic written documents?

Just get a DEF 14A (Annual meeting) filing of a company from SEC EDGAR.

I have seen so many mistakes when looking at the result closely.

Here is a DEF 14A filing from Salseforce. You can print it to a PDF and then try converting.

https://www.sec.gov/Archives/edgar/data/1108524/000110852425...

replies(1): >>45643178 #
grosswait ◴[] No.45643178[source]
Historical filings are still a problem, but hasn’t the SEC required filing in an XML format since the end of 2024?
replies(1): >>45643659 #
1. richardlblair ◴[] No.45643659[source]
It's not really about SEC filings, though. While we folks on HN would never think of hard copies of invoices, but much of the world still operates this way.

As mentioned above I have about 200 construction invoices. They are all formatted in a way that doesn't make sense. Most fail both OCR and OpenAI

replies(1): >>45645579 #
2. KoolKat23 ◴[] No.45645579[source]
OpenAI has unusuably low image DPI. Try Gemini.