Most active commenters

Popular/hot comments

(github.com)

Show context

pietz ◴[20 Oct 25 08:49 UTC] No.45641449[source]▶

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

replies(23): >>45641470 #>>45641479 #>>45641533 #>>45641536 #>>45641612 #>>45641806 #>>45641890 #>>45641904 #>>45642270 #>>45642699 #>>45642756 #>>45643016 #>>45643911 #>>45643964 #>>45644404 #>>45644848 #>>45645032 #>>45645325 #>>45646756 #>>45647189 #>>45647776 #>>45650079 #>>45651460 #

1. raincole ◴[20 Oct 25 08:58 UTC] No.45641533[source]▶

>>45641449 #

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

replies(4): >>45641608 #>>45642140 #>>45643829 #>>45645028 #

2. jakewins ◴[20 Oct 25 09:05 UTC] No.45641608[source]▶

>>45641533 (TP) #

But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness

replies(2): >>45642440 #>>45643820 #

3. red75prime ◴[20 Oct 25 10:08 UTC] No.45642140[source]▶

>>45641533 (TP) #

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).

4. rkagerer ◴[20 Oct 25 10:50 UTC] No.45642440[source]▶

>>45641608 #

Good libraries gave results with embedded confidence levels for each unit recognized.

5. wahnfrieden ◴[20 Oct 25 13:41 UTC] No.45643820[source]▶

>>45641608 #

Existing ocr doesn’t skip over entire (legible) paragraphs or hallucinate entire sentences

replies(3): >>45643920 #>>45644305 #>>45645395 #

6. wahnfrieden ◴[20 Oct 25 13:42 UTC] No.45643829[source]▶

>>45641533 (TP) #

Do any LLM OCRs give bounding boxes anyway? Per character and per block.

replies(2): >>45647263 #>>45674352 #

7. Davidzheng ◴[20 Oct 25 13:50 UTC] No.45643920{3}[source]▶

>>45643820 #

rarely happens to me using LLMs to transcribe pdfs

8. criddell ◴[20 Oct 25 14:29 UTC] No.45644305{3}[source]▶

>>45643820 #

I usually run the image(s) through more than one converter then compare the results. They all have problems, but the parts they agree on are usually correct.

9. KoolKat23 ◴[20 Oct 25 15:27 UTC] No.45645028[source]▶

>>45641533 (TP) #

These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

Google's image quality on uploads is still streets ahead of openai for instance btw.

10. KoolKat23 ◴[20 Oct 25 15:58 UTC] No.45645395{3}[source]▶

>>45643820 #

This must be some older/smaller model.

11. kelvinjps10 ◴[20 Oct 25 18:20 UTC] No.45647263[source]▶

>>45643829 #

Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...

Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...

I hope that this capability improves so I can use only Gemini API.

12. dajonker ◴[22 Oct 25 19:58 UTC] No.45674352[source]▶

>>45643829 #

Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.

↑

DeepSeek OCR