Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 5 comments | 05 Feb 25 18:05 UTC | HN request time: 1.049s | source

Show context

lazypenguin ◴[05 Feb 25 19:19 UTC] No.42953665[source]▶

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #

kbyatnal ◴[06 Feb 25 00:46 UTC] No.42957551[source]▶

>>42953665 #

This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.

The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).

Disclaimer: I started an LLM doc processing infra company (https://extend.app/)

replies(6): >>42960720 #>>42964598 #>>42971548 #>>42993825 #>>42999533 #>>43081041 #

TeMPOraL ◴[06 Feb 25 09:40 UTC] No.42960720[source]▶

>>42957551 #

> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.

replies(4): >>42961020 #>>42962159 #>>42965369 #>>43072136 #

sitkack ◴[06 Feb 25 13:38 UTC] No.42962159[source]▶

>>42960720 #

Software is dead, if it isn't a prompt now, it will be a prompt in 6 months.

Most of what we think software is today, will just be a UI. But UIs are also dead.

replies(4): >>42962384 #>>42963150 #>>42964153 #>>42965019 #

victorbjorklund ◴[06 Feb 25 16:46 UTC] No.42964153[source]▶

>>42962159 #

Can you prompt a salesforce replacement for an org with 100 000 employees?

replies(1): >>42966218 #

mrbungie ◴[06 Feb 25 20:33 UTC] No.42966218[source]▶

>>42964153 #

Yesterday I read an /r/singularity post in awe cus of a screenshot of a lead management platform from OAI in a japan convention supposedly meant a direct threat to SalesForce. Like, yeah sure buddy.

I would say most acceleracionist/AI bulls/etc don't really understand the true essential complexity in software development. LLMs are being seen as a software development silver bullets, and we know what happens with silver bullets.

replies(1): >>42966233 #

sitkack ◴[06 Feb 25 20:35 UTC] No.42966233[source]▶

>>42966218 #

Come back your comment in 18 months.

replies(1): >>42966381 #

1. collingreen ◴[06 Feb 25 20:51 UTC] No.42966381[source]▶

>>42966233 #

I assume this is a slap intended to imply that ai actually IS a silver bullet answer to the parent's described problem and in just 18 months they will look back and realize how wrong they are.

Is that what you mean and, if so, is there anything in particular you've seen that leads you to see these problems being solved well or on the 18 month timeline? That sounds interesting to look at to me and I'd love to know more.

replies(1): >>42967637 #

2. sitkack ◴[06 Feb 25 23:41 UTC] No.42967637[source]▶

>>42966381 (TP) #

It isn't a silver bullet in that it can just "make software" but it is changing the entire dynamic.

You can't do point sampling to figure out where things are going. We have to look at the slope. People see a paper come out, look at the results and say, "this fails for x, y and z. doesn't work", that is now how scientific research works. This is why two minute papers has the tag line, "hold on to your papers ... two papers down the line ..."

Copy and paste the whole thread into a SOTA model and have meta me explain it.

replies(1): >>42980359 #

3. ethbr1 ◴[08 Feb 25 04:23 UTC] No.42980359[source]▶

>>42967637 #

That's not why more experienced people are doubting you.

They're doubting you because the non-digital portions of processes change at people/org speed.

Which is to say that changing a core business process is a year political consensus, rearchitecture, and change management effort, because you also have to coordinate all the cascading and interfacing changes.

replies(1): >>42985568 #

4. sitkack ◴[08 Feb 25 19:38 UTC] No.42985568{3}[source]▶

>>42980359 #

> changing a core business process is a year political consensus, rearchitecture, and change management effort

You are thinking within the existing structures, those structures will evaporate. All along the software supply chain, processes will get upended, not just because of how technical assets will be created, but also how organizations themselves are structured and react and in turn how software is created and consumed.

This is as big as the invention of the corporation, the printing press and the industrial revolution.

I am not here to tutor people on this viewpoint or defend it, I offer it and everyone can do with it what they will.