Most active commenters
  • pietz(5)
  • KoolKat23(5)
  • kbumsik(3)

←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 63 comments | | HN request time: 0.409s | source | bottom
1. pietz ◴[] No.45641449[source]
My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

replies(23): >>45641470 #>>45641479 #>>45641533 #>>45641536 #>>45641612 #>>45641806 #>>45641890 #>>45641904 #>>45642270 #>>45642699 #>>45642756 #>>45643016 #>>45643911 #>>45643964 #>>45644404 #>>45644848 #>>45645032 #>>45645325 #>>45646756 #>>45647189 #>>45647776 #>>45650079 #>>45651460 #
2. kbumsik ◴[] No.45641470[source]
> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

replies(2): >>45641501 #>>45643548 #
3. carschno ◴[] No.45641479[source]
Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.
replies(3): >>45641563 #>>45641605 #>>45641795 #
4. coulix ◴[] No.45641501[source]
This.

Any complex parent table span cell relationship still has low accuracy.

Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

They will fail.

replies(2): >>45641541 #>>45641916 #
5. raincole ◴[] No.45641533[source]
If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

replies(4): >>45641608 #>>45642140 #>>45643829 #>>45645028 #
6. peter-m80 ◴[] No.45641536[source]
No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them requiere a lot of human intervention
replies(3): >>45641544 #>>45641838 #>>45644342 #
7. bobsmooth ◴[] No.45641541{3}[source]
Maybe I misunderstood the assignment but it seems to work for me.

https://chatgpt.com/share/68f5f9ba-d448-8005-86d2-c3fbae028b...

Edit: Just caught a mistake, transcribed one of the prices incorrectly.

replies(1): >>45641692 #
8. jmkni ◴[] No.45641544[source]
do you have an example of a particularly tricky one?
replies(1): >>45641617 #
9. sramam ◴[] No.45641563[source]
Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.

replies(1): >>45642404 #
10. mormegil ◴[] No.45641605[source]
This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.
11. jakewins ◴[] No.45641608[source]
But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness
replies(2): >>45642440 #>>45643820 #
12. darkwater ◴[] No.45641612[source]
So, the mug with inspirational text says "Bountiful Potential"?
13. ekianjo ◴[] No.45641617{3}[source]
Just try old ads you will see how hard it gets
14. kbumsik ◴[] No.45641692{4}[source]
Right, I wouldn't use full table detection to VLM model because they tend to mistake with numbers in table...
15. pietz ◴[] No.45641795[source]
We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.
replies(1): >>45642703 #
16. cahaya ◴[] No.45641806[source]
Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

replies(3): >>45641970 #>>45642001 #>>45645000 #
17. pietz ◴[] No.45641838[source]
Could you provide an example that fails? I'm interested in this.
18. vintermann ◴[] No.45641890[source]
OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.
19. robotswantdata ◴[] No.45641904[source]
VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.
20. pietz ◴[] No.45641916{3}[source]
Maybe my imagination is limited or our documents aren't complex enough, but are we talking about realistic written documents? I'm sure you can take a screenshot of a very complex spreadsheet and it fails, but in that case you already have the data in structured form anyway, no?
replies(2): >>45642356 #>>45644170 #
21. pietz ◴[] No.45641970[source]
I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.
22. CaptainOfCoit ◴[] No.45642001[source]
> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.

replies(3): >>45642072 #>>45643869 #>>45643966 #
23. eeixlk ◴[] No.45642072{3}[source]
htceaad t nofdnsy lyruuieo sieerrr t owcope?
24. red75prime ◴[] No.45642140[source]
Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).
25. sbinnee ◴[] No.45642270[source]
Maybe for English. Other languages are very much not solved.
26. kbumsik ◴[] No.45642356{4}[source]
> realistic written documents?

Just get a DEF 14A (Annual meeting) filing of a company from SEC EDGAR.

I have seen so many mistakes when looking at the result closely.

Here is a DEF 14A filing from Salseforce. You can print it to a PDF and then try converting.

https://www.sec.gov/Archives/edgar/data/1108524/000110852425...

replies(1): >>45643178 #
27. lazide ◴[] No.45642404{3}[source]
Often, the review LLM will also say everything is okay when it’s made up too.
28. rkagerer ◴[] No.45642440{3}[source]
Good libraries gave results with embedded confidence levels for each unit recognized.
29. llm_nerd ◴[] No.45642699[source]
Complex documents is where OCR struggles mightily. If you have a simple document with paragraphs of text, sure OCR is pretty solved. If you have a complex layout with figures and graphs and supporting images and asides and captions and so on (basically any paper, or even trade documents), it absolutely falls apart.

And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.

There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.

30. CraigRood ◴[] No.45642703{3}[source]
I have a thought that whilst LLM providers can say "Sorry" - there is little incentive and it will expose the reality that they are not very accurate, nor can be properly measured. That said, there clearly are use cases where if the LLM can't a certain level of confidence it should refer to the user, rather than guessing.
replies(1): >>45649569 #
31. burpsnard ◴[] No.45642756[source]
I've only used tesseract, 'recreationally', but i tried generating images of random chars to see what resolution/contrast/noise was minimally recognisable; shocked at how bad it was. heavily relies on language models of character sequences, pretty useless On 'line noise'
32. baobun ◴[] No.45643016[source]
Chinese, especially handwritten.
33. grosswait ◴[] No.45643178{5}[source]
Historical filings are still a problem, but hasn’t the SEC required filing in an XML format since the end of 2024?
replies(1): >>45643659 #
34. richardlblair ◴[] No.45643548[source]
I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail through both OCR and OpenAI.

It's a hard (and very interesting) problem space.

35. richardlblair ◴[] No.45643659{6}[source]
It's not really about SEC filings, though. While we folks on HN would never think of hard copies of invoices, but much of the world still operates this way.

As mentioned above I have about 200 construction invoices. They are all formatted in a way that doesn't make sense. Most fail both OCR and OpenAI

replies(1): >>45645579 #
36. wahnfrieden ◴[] No.45643820{3}[source]
Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences
replies(3): >>45643920 #>>45644305 #>>45645395 #
37. wahnfrieden ◴[] No.45643829[source]
Do any LLM OCRs give bounding boxes anyway? Per character and per block.
replies(2): >>45647263 #>>45674352 #
38. kmacdough ◴[] No.45643869{3}[source]
> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.

39. Davidzheng ◴[] No.45643911[source]
I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams almost all models still struggle. especially very complicated commutative diagrams.
40. Davidzheng ◴[] No.45643920{4}[source]
rarely happens to me using LLMs to transcribe pdfs
41. simlevesque ◴[] No.45643964[source]
> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

Can you explain more about your setup ? I have a quarter million pages I want to OCR.

42. kmacdough ◴[] No.45643966{3}[source]
> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.

replies(1): >>45644127 #
43. CaptainOfCoit ◴[] No.45644127{4}[source]
Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and structure". It is true that some OCR systems can handle trying to replicate the structure, but still today I think that's the exception, not the norm.

Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.

44. daemonologist ◴[] No.45644170{4}[source]
Now if someone mails or faxes you that spreadsheet? You're screwed.

Spreadsheets are not the biggest problem though, as they have a reliable 2-dimensional grid - at worst some cells will be combined. The form layouts and n-dimensional table structures you can find on medical and insurance documents are truly unhinged. I've seen documents that I struggled to interpret.

replies(1): >>45645458 #
45. criddell ◴[] No.45644305{4}[source]
I usually run the image(s) through more than one converter then compare the results. They all have problems, but the parts they agree on are usually correct.
46. constantinum ◴[] No.45644342[source]
I use LLMWhisperer[1] for OCR'ing old magazine ads. It preserves the layout and context. Example > https://postimg.cc/ts3vT7kG

https://pg.llmwhisperer.unstract.com/

47. constantinum ◴[] No.45644404[source]
Why PDF parsing is Hell[1]:

Fixed layout and lack of semantic structure in PDFs.

Non-linear text flow due to columns, sidebars, or images.

Position-based text without contextual or relational markers.

Absence of standard structure tags (like in HTML).

Scanned or image-based PDFs requiring OCR.

Preprocessing needs for scanned PDFs (noise, rotation, skew).

Extracting tables from unstructured or visually complex layouts.

Multi-column and fancy layouts breaking semantic text order.

Background images and watermarks interfering with text extraction.

Handwritten text recognition challenges.

[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

48. KoolKat23 ◴[] No.45644848[source]
I agree, Gemini 2.5 models are excellent.

The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.

It would also be nice for there to be a good competitor.

49. ◴[] No.45645000[source]
50. KoolKat23 ◴[] No.45645028[source]
These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

Google's image quality on uploads is still streets ahead of openai for instance btw.

51. cle ◴[] No.45645032[source]
That will not work with many of the world's most important documents because of information density. For example, dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting and features like strikethroughs, etc.

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.

52. Gazoche ◴[] No.45645325[source]
There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on your problem domain.

Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.

But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.

53. KoolKat23 ◴[] No.45645395{4}[source]
This must be some older/smaller model.
54. KoolKat23 ◴[] No.45645458{5}[source]
To be fair, this is problematic for humans too. My old insurer outright rejected things like that stating it's not legible.

(I imagine it also had the benefit of reducing fraud/errors).

In this day and age, it's probably easier/better to change the process around that as there's little excuse for such shit quality input. I understand this isn't always possible though.

55. KoolKat23 ◴[] No.45645579{7}[source]
OpenAI has unusuably low image DPI. Try Gemini.
56. 6gvONxR4sf7o ◴[] No.45646756[source]
OCR for printed documents is super robust, but handwriting, low res, and aligned recognition (not just image to "hello world" but also having "h is here in space e is here in space...) are all still well behind "basically solved."
57. kelvinjps10 ◴[] No.45647189[source]
Google vision it's still better than Gemini at OCR, for example at getting bounding boxes.
58. kelvinjps10 ◴[] No.45647263{3}[source]
Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...

Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...

I hope that this capability improves so I can use only Gemini API.

59. blindriver ◴[] No.45647776[source]
I attempted OCR using all of the open source models available about 3 months ago, including Llama 4. These were pngs of text using a regular font. Most produced garbage except Llama 4, and even then it was only about 90% accurate. Using OpenAI or Gemini produced much better results but the open source models were really bad.
60. Rudybega ◴[] No.45649569{4}[source]
This is actively being worked on my pretty much every major provider. It was the subject of that recent OpenAI paper on hallucinations. It's mostly caused by benchmarks that reward correct answers, but don't penalize bad answers more than simply not answering.

E.g.

Most current benchmarks have a scoring scheme of

1 - Correct Answer 0 - No answer or incorrect answer

But what they need is something more like

1 - Correct Answer 0.25 - No answer 0 - Incorrect answer

You need benchmarks (particularly those used in training) to incentivize the models to acknowledge when they're uncertain.

61. themanmaran ◴[] No.45650079[source]
> OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.

Gemini is definitely the best model for OCR. But it has a really high rate of "recitation" errors. Where it will determine the output token is too close to its training data and cut it off. Something like 10% of the time from our testing. Also it has this hilarious hallucination when you have a blank page in the document mix and it just makes up new info.

OpenAI is OK. GPT5 wasn't any better than 4o or 4.1. Main issues were: dropping content like headers/footers, loses it's mind on sideways pages, and will frequently refuse to read things like ID documents, health care forms, or things it judges to have too much PII.

62. veidr ◴[] No.45651460[source]
Clearly-printed text to a sequence of characters is solved, for use cases that don't require 100% accuracy.

But not for semantic document structure — recognizing that the grammatically incomplete phrase in a larger font is a heading, recognizing subheadings and bullet lists, tables, etc.

Also not for handwritten text, text inside of images (signage and so forth), or damaged source material (old photocopies and scans created in the old days).

Those areas all seem to me where an LLM-based approach could narrow the gap between machine recognition and humans. You have to sort of reason about it from the context as a human to figure it out, too.

63. dajonker ◴[] No.45674352{3}[source]
Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.