Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 2 comments | 05 Feb 25 18:05 UTC | HN request time: 0.628s | source

Show context

lazypenguin ◴[05 Feb 25 19:19 UTC] No.42953665[source]▶

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #

j_timberlake ◴[06 Feb 25 00:07 UTC] No.42957231[source]▶

>>42953665 #

This sounds extremely like my old tax accounting job. OCR existed and "worked" but it was faster to just enter the numbers manually than fix all the errors.

Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.

replies(2): >>42957382 #>>42959469 #

Andrex ◴[06 Feb 25 00:24 UTC] No.42957382[source]▶

>>42957231 #

They finally made filing free.

So, maybe this century?

replies(1): >>42958476 #

kennyloginz ◴[06 Feb 25 02:59 UTC] No.42958476[source]▶

>>42957382 #

Check again, Elon and his Doge team killed that.

replies(1): >>42959610 #

1. happyopossum ◴[06 Feb 25 06:20 UTC] No.42959610[source]▶

>>42958476 #

No they didn’t, that claim is ridiculously easy to debunk but it has been going around because it fits the narrative.

replies(1): >>42965488 #

2. djeastm ◴[06 Feb 25 19:12 UTC] No.42965488[source]▶

>>42959610 (TP) #

It'd be nicer if you wouldn't presume to know the reasons people might believe erroneous information.

In this case, the reason for the misinformation is do to the lack of communication from the DOGE entity regarding their actions. Mr. Musk wrote via Tweet that he had "deleted" the digital services agency "18F" that develops the IRS Free File program and also deleted their X account.

https://apnews.com/article/irs-direct-file-musk-18f-6a4dc35a...

If indeed he did cut the agency, it remains to be see how long the application will be operational.

↑