Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 5 comments | 05 Feb 25 18:05 UTC | HN request time: 0.697s | source

Show context

llm_trw ◴[05 Feb 25 21:27 UTC] No.42955414[source]▶

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

replies(17): >>42955515 #>>42956087 #>>42956247 #>>42956265 #>>42956619 #>>42957414 #>>42958781 #>>42958962 #>>42959394 #>>42960744 #>>42960927 #>>42961296 #>>42961613 #>>42962243 #>>42962387 #>>42965540 #>>42983927 #

1. jeswin ◴[06 Feb 25 14:02 UTC] No.42962387[source]▶

>>42955414 #

I feel compelled to reply. You've made a bunch of assumptions, and presented your success (likely with a limited set of table formats) as the one true way to parse PDFs. There's no such thing.

In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.

Recently someone posted this on HN, it's a good read: https://lukaspetersson.com/blog/2025/bitter-vertical/

> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.

> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.

> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.

replies(2): >>42964167 #>>42967331 #

2. metadat ◴[06 Feb 25 16:49 UTC] No.42964167[source]▶

>>42962387 (TP) #

Related discussion:

AI founders will learn the bitter lesson

https://news.ycombinator.com/item?id=42672790 - 25 days ago, 263 comments

The HN discussion contains a lot of interesting ideas, thanks for the pointer!

3. llm_trw ◴[06 Feb 25 22:56 UTC] No.42967331[source]▶

>>42962387 (TP) #

You're making an even less charitable set of assumptions:

1). I'm incompetent enough to ignore publicly available table benchmarks.

2). I'm incompetent enough to never look at poor quality data.

3). I'm incompetent enough to not create a validation dataset for all models that were available.

Needless to say you're wrong on all three.

My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.

replies(2): >>42967379 #>>43031954 #

4. pkkkzip ◴[06 Feb 25 23:04 UTC] No.42967379[source]▶

>>42967331 #

whoa, this is a really aggressive response. No one is calling you incompetent rather criticizing your assumptions.

> My day rate is $400 + taxes per hour if you want to be run through each point

Great, thanks for sharing.

5. danielparsons ◴[13 Feb 25 02:07 UTC] No.43031954[source]▶

>>42967331 #

bragging about billing $400 an hour LOL

↑