Most active commenters

    ←back to thread

    1303 points serjester | 12 comments | | HN request time: 1.464s | source | bottom
    Show context
    lazypenguin ◴[] No.42953665[source]
    I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

    Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

    replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #
    1. yzydserd ◴[] No.42954088[source]
    How do today’s LLM’s like Gemini compare with the Document Understanding services google/aws/azure have offered for a few years, particularly when dealing with known forms? I think Google’s is Document AI.
    replies(3): >>42954334 #>>42955867 #>>42956923 #
    2. zacmps ◴[] No.42954334[source]
    I've found the highest accuracy solution is to OCR with one of the dedicated models then feed that text and the original image into an LLM with a prompt like:

    "Correct errors in this OCR transcription".

    replies(2): >>42954503 #>>42954985 #
    3. bradfox2 ◴[] No.42954503[source]
    This is what we do today. Have you tried it against Gemini 2.0?
    4. therein ◴[] No.42954985[source]
    How does it behave if the body of text is offensive or what if it is talking about a recipe to purify UF-6 gas at home? Will it stop doing what it is doing and enter lecturing mode?

    I am asking not to be cynical but because of my limited experience with using LLMs for any task that may operate on offensive or unknown input seems to get triggered by all sorts of unpredictable moral judgements and dragged into generating not the output I wanted, at all.

    If I am asking this black box to give me a JSON output containing keywords for a certain text, if it happens to be offensive, it refuses to do that.

    How does one tackle with that?

    replies(4): >>42955472 #>>42955555 #>>42955905 #>>42959036 #
    5. zacmps ◴[] No.42955472{3}[source]
    It's not something I've needed to deal with personally.

    We have run into added content filters in Azure OpenAI on a different application, but we just put in a request to tune them down for us.

    6. xnx ◴[] No.42955555{3}[source]
    There are many settings for changing the safety level in Gemini API calls: https://ai.google.dev/gemini-api/docs/safety-settings
    replies(1): >>42961586 #
    7. ajcp ◴[] No.42955867[source]
    GCP's Document AI service is now literally just a UI layer specific to document parsing use-cases back by Gemini models. When we realized that we dumped it and just use Gemini directly.
    8. sumedh ◴[] No.42955905{3}[source]
    Try setting the safety params to none and see if that makes any difference.
    9. anirudhb99 ◴[] No.42956923[source]
    member of the gemini team here -- personally, i'd recommend directly using gemini vs the document understanding services for OCR & general docs understanding tasks. From our internal evals gemini is now stronger than these solutions and is only going to get much better (higher precision, lower hallucination rates) from here.
    replies(1): >>42958486 #
    10. joelhaus ◴[] No.42958486[source]
    Could we connect offline about using Gemini instead of the doc ai custom extractor we currently use in production?

    This sounds amazing & I'd love your input on our specific use case.

    joelatoutboundin.com

    11. devjab ◴[] No.42959036{3}[source]
    We use the Azure models and there isn't an issue with safety filters as such for enterprise customers. The one time we had an issue microsoft changed the safety measures. Of course the safety measures we might meet are the sort of engineering which could be interpreted as weapons manufacturing, and not "political" as such. Basically the safety guard rails seem to be added on top of all these models, which means they can also be removed without impacting the model. I could be wrong on that, but it seems that way.
    12. shijithpk ◴[] No.42961586{4}[source]
    This is for anyone coming across this link later. In their latest SDKs, if you want to completely switch off their safety settings, the flag to use is 'OFF' and not 'BLOCK_NONE' as mentioned in the docs in the link above.

    The Gemini docs don't refect that change yet. https://discuss.ai.google.dev/t/safety-settings-2025-update-...