DeepSeek OCR

(github.com)

Show context

breadislove ◴[20 Oct 25 12:13 UTC] No.45643006[source]▶

>>45640594 (OP) #

For everyone wondering how good this and other benchmarks are:

- the OmniAI benchmark is bad

- Instead check OmniDocBench[1] out

- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini

- End to End OCR is still extremely tricky

- composed pipelines work better (layout detection -> reading order -> OCR every element)

- complex table parsing is still extremely difficult

[1]: https://github.com/opendatalab/OmniDocBench

replies(2): >>45643626 #>>45647948 #

hakunin ◴[20 Oct 25 13:24 UTC] No.45643626[source]▶

>>45643006 #

Wish someone benchmarked Apple Vision Framework against these others. It's built into most Apple devices, but people don't know you can actually harness it to do fast, good quality OCR for you (and go a few extra steps to produce searchable pdfs, which is my typical use case). I'm very curious where it would fall in the benchmarks.

replies(3): >>45643785 #>>45643798 #>>45645485 #

wahnfrieden ◴[20 Oct 25 13:39 UTC] No.45643785[source]▶

>>45643626 #

It is unusable trash for languages with any vertical writing such as Japanese. It simply doesn’t work.

replies(1): >>45644032 #

1. thekid314 ◴[20 Oct 25 14:00 UTC] No.45644032[source]▶

>>45643785 #

Yeah, and fails quickly at anything handwritten.

replies(2): >>45644877 #>>45648073 #

2. hakunin ◴[20 Oct 25 15:14 UTC] No.45644877[source]▶

>>45644032 (TP) #

I mostly OCR English, so Japanese (as mentioned by parent) wouldn't be an issue for me, but I do care about handwriting. See, these insights are super helpful. If only there was, say, a benchmark to show these.

My main question really is: what are practical OCR tools that I can string together on my MacBook Pro M1 Max w/ 64GB Ram to maximize OCR quality for lots of mail and schoolwork coming into my house, all mostly in English.

I use ScanSnap Manager with its built in OCR tools, but that's probably super outdated by now. Apple Vision does way better job than that. I heard people say also that Apple Vision is better than Tesseract. But is there something better still that's also practical to run in a scripted environment on my machine?

3. wahnfrieden ◴[20 Oct 25 19:25 UTC] No.45648073[source]▶

>>45644032 (TP) #

LiveText too? It has a newer engine

replies(1): >>45648263 #

4. hakunin ◴[20 Oct 25 19:39 UTC] No.45648263[source]▶

>>45648073 #

This is the second comment of yours about LiveText (this is the older one https://news.ycombinator.com/item?id=43192141) — I found that one by complete coincidence because I'm trying to provide a Ruby API for these frameworks. However, I can't find much info on LiveText? What framework is it part of? Do you have any links or any additional info? I found a source where they say it's specifically for screen and camera capturing.

replies(1): >>45648311 #

5. wahnfrieden ◴[20 Oct 25 19:43 UTC] No.45648311{3}[source]▶

>>45648263 #

https://developer.apple.com/documentation/visionkit/imageana... VisionKit. Swift-only (as with many new APIs) so lots of people stuck on ObjC bridges simply ignore it.

It does not provide bounding boxes but you can get text.

replies(1): >>45648652 #

6. hakunin ◴[20 Oct 25 20:08 UTC] No.45648652{4}[source]▶

>>45648311 #

That's great, I'm going to give this a shot. If you have any more resources please do share. I don't mind Swift-only, because I'm writing little shims with `@_cdecl` for the bridge (don't have much experience here, but hoping this is going to work, leaning on AI for support).

↑