Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 1 comments | 05 Feb 25 18:05 UTC | HN request time: 0s | source

Show context

llm_trw ◴[05 Feb 25 21:27 UTC] No.42955414[source]▶

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

replies(17): >>42955515 #>>42956087 #>>42956247 #>>42956265 #>>42956619 #>>42957414 #>>42958781 #>>42958962 #>>42959394 #>>42960744 #>>42960927 #>>42961296 #>>42961613 #>>42962243 #>>42962387 #>>42965540 #>>42983927 #

woah ◴[06 Feb 25 00:30 UTC] No.42957414[source]▶

>>42955414 #

Getting "bitter lesson" vibes from this post

replies(1): >>42958009 #

llm_trw ◴[06 Feb 25 01:52 UTC] No.42958009[source]▶

>>42957414 #

The bitter lesson is very little of the sort.

If we had unlimited memory, compute and data we'd use a rank N tensor for an input of length N and call it a day.

Unfortunately N^N grows rather fast and we have to do all sorts of interesting engineering to make ML calculations complete before the heat death of the universe.

replies(2): >>42958128 #>>42958874 #

woah ◴[06 Feb 25 03:53 UTC] No.42958874[source]▶

>>42958009 #

> Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.

replies(1): >>42959570 #

llm_trw ◴[06 Feb 25 06:12 UTC] No.42959570[source]▶

>>42958874 #

To solve mnist without mathematical tricks like convolutions or attention heads you would nees 2.5e42 weights. Assuming that you're using 16 bit weights that 5e42 bytes. A yotta byte is 10e24.

That is you'd need 5 exa yotta bytes to solve it.

Currently the whole world has around 200 zetabytes of storage.

I short for the next 120 years mnist will need mathematical tricks to be solved.

replies(1): >>42960533 #

1. flask_manager ◴[06 Feb 25 09:08 UTC] No.42960533{3}[source]▶

>>42959570 #

The distinction that i think is important to make when talking about "the bitter lesson" is that improving the compute and training infrastructure and tricks in the abstract wins over intelligent model and system design.

Its more about the information about the specific problem you are solving having less impact than techniques that target the compute. So in this case, breaking down how to parse a PDF in stages for your domain is involving specific expert knowledge of the domain, but training with attention is about efficient use of compute in general; with no domain expertise.

↑