Most active commenters

    ←back to thread

    DeepSeek OCR

    (github.com)
    990 points pierre | 13 comments | | HN request time: 0.327s | source | bottom
    1. yoran ◴[] No.45640836[source]
    How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?
    replies(7): >>45640943 #>>45640992 #>>45642214 #>>45643557 #>>45644126 #>>45647313 #>>45667751 #
    2. sandblast ◴[] No.45640943[source]
    Not sure why you're being downvoted, I'm also curious.
    3. ozgune ◴[] No.45640992[source]
    OmniAI has a benchmark that companies LLMs to cloud OCR services.

    https://getomni.ai/blog/ocr-benchmark (Feb 2025)

    Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.

    replies(2): >>45642739 #>>45647914 #
    4. numpad0 ◴[] No.45642214[source]
    Classical OCR still probably make undesirable su6stıtutìons in CJK from there being far too many of similar ones, even some absurd ones that are only distinguishable under microscope or by looking at binary representations. LLMs are better constrained to valid sequences of characters, and so they would be more accurate.

    Or at least that kind of thing would motivate them to re-implement OCR with LLM.

    replies(1): >>45644008 #
    5. CaptainOfCoit ◴[] No.45642739[source]
    Magistral-Small-2509 is pretty neat as well for its size, has reasoning + multimodality, which helps in some cases where context isn't immediately clear, or there are few missing spots.
    6. make3 ◴[] No.45643557[source]
    aren't all of these multimodal LLM approaches, just open vs closed ones
    7. fluoridation ◴[] No.45644008[source]
    Huh... Would it work to have some kind of error checking model that corrected common OCR errors? That seems like it should be relatively easy.
    replies(1): >>45646514 #
    8. daemonologist ◴[] No.45644126[source]
    My base expectation is that the proprietary OCR models will continue to win on real-world documents, and my guess is that this is because they have access to a lot of good private training data. These public models are trained on arxiv and e-books and stuff, which doesn't necessarily translate to typical business documents.

    As mentioned though, the LLMs are usually better at avoiding character substitutions, but worse at consistency across the entire page. (Just like a non-OCR LLM, they can and will go completely off the rails.)

    9. colonCapitalDee ◴[] No.45646514{3}[source]
    It's harder then it first seems. The root problem is that for text like "hallo", correcting to "hello" may be fixing an error or introducing an error. In general, the more aggressive your error correction, the more errors you inadvertently introduce. You can try and make a judgement based on context ("hallo, how are you?"), which certainly helps, but it's only a mitigation. Light error correction is common and effective, but you can't push it to a full solution. The only way to fully solve this problem is to look at the entire document at once so you have maximum context available, and this is what non-traditional OCR attempts to do.
    replies(1): >>45646597 #
    10. fluoridation ◴[] No.45646597{4}[source]
    Okay, but there way more common errors that should be easy to fix. "He11o", "Emest Herningway", incorrect diacritics like the other person mentioned, etc.
    11. stopyellingatme ◴[] No.45647313[source]
    Not sure about the others but we use Azure AI Document Intelligence and its working well for our resume parsing system. Took a good bit of tuning but we havent had to touch it for almost a year now.
    12. cheema33 ◴[] No.45647914[source]
    Omni OCR team says that according to their own benchmark, the best OCR is the Omni OCR. I am quite surprised.
    13. junto ◴[] No.45667751[source]
    Not sure how it compares but we did some trials with Azure AI Document Intelligence and were very surprised at how good it was. We had a document example which was a poor photograph of a document that had quite a skew, and it (too our surprise), also detected the customer’s human legible signature and extracted their name from that signature.