Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 1 comments | 05 Feb 25 18:05 UTC | HN request time: 0s | source

Show context

llm_trw ◴[05 Feb 25 21:27 UTC] No.42955414[source]▶

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

replies(17): >>42955515 #>>42956087 #>>42956247 #>>42956265 #>>42956619 #>>42957414 #>>42958781 #>>42958962 #>>42959394 #>>42960744 #>>42960927 #>>42961296 #>>42961613 #>>42962243 #>>42962387 #>>42965540 #>>42983927 #

ajcp ◴[05 Feb 25 22:17 UTC] No.42956087[source]▶

>>42955414 #

Not sure what service you're basing your calculation on but with Gemmini I've processed 10,000,000+ shipping documents (PDF and PNGs) of every concievable layout in one month at under $1000 and an accuracy rate of between 80-82% (humans were at 66%).

The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database

Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.

replies(3): >>42956215 #>>42956929 #>>42965774 #

llm_trw ◴[05 Feb 25 23:34 UTC] No.42956929[source]▶

>>42956087 #

>Not sure what service you're basing your calculation on but with Gemmini

The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.

    | Model                       | Pages per Dollar |
    |-----------------------------+------------------|
    | Gemini 2.0 Flash            | ≈ 6,000          |
    | Gemini 2.0 Flash Lite       | ≈ 12,000*        |
    | Gemini 1.5 Flash            | ≈ 10,000         |
    | AWS Textract                | ≈ 1,000          |
    | Gemini 1.5 Pro              | ≈ 700            |
    | OpenAI 4o-mini              | ≈ 450            |
    | LlamaParse                  | ≈ 300            |
    | OpenAI 4o                   | ≈ 200            |
    | Anthropic claude-3-5-sonnet | ≈ 100            |
    | Reducto                     | ≈ 100            |
    | Chunkr                      | ≈ 100            |

There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.

>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database

Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.

replies(1): >>42957126 #

serjester ◴[05 Feb 25 23:55 UTC] No.42957126[source]▶

>>42956929 #

That assumes that you're able to find a model that can match Gemini's performance - I haven't come across anything that comes close (although hopefully that changes).

replies(1): >>42962443 #

1. jeswin ◴[06 Feb 25 14:08 UTC] No.42962443[source]▶

>>42957126 #

Nice article, mirrors my experience. Last year (around when multimodal 3.5 Sonnet launched), I had run a sizeable number of PDFs through it. Accuracy was remarkably high (99%-ish), whereas GPT was just unusable for this purpose.

↑