Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 1 comments | 05 Feb 25 18:05 UTC | HN request time: 1.131s | source

Show context

llm_trw ◴[05 Feb 25 21:27 UTC] No.42955414[source]▶

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

replies(17): >>42955515 #>>42956087 #>>42956247 #>>42956265 #>>42956619 #>>42957414 #>>42958781 #>>42958962 #>>42959394 #>>42960744 #>>42960927 #>>42961296 #>>42961613 #>>42962243 #>>42962387 #>>42965540 #>>42983927 #

vikp ◴[05 Feb 25 23:01 UTC] No.42956619[source]▶

>>42955414 #

Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.

I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.

replies(4): >>42957047 #>>42960378 #>>42960463 #>>42960648 #

1. llm_trw ◴[05 Feb 25 23:46 UTC] No.42957047[source]▶

>>42956619 #

I used sxml [0] unironically in this project extensively.

The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.

[0] https://en.wikipedia.org/wiki/SXML

↑