←back to thread

1303 points serjester | 9 comments | | HN request time: 0.208s | source | bottom
Show context
llm_trw ◴[] No.42955414[source]
This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.

You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.

You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.

You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.

You feed each image box into a multimodal model to describe what the image is about.

For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.

You then stitch everything together in an XML file because Markdown is for human consumption.

You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.

You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.

You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.

I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

replies(17): >>42955515 #>>42956087 #>>42956247 #>>42956265 #>>42956619 #>>42957414 #>>42958781 #>>42958962 #>>42959394 #>>42960744 #>>42960927 #>>42961296 #>>42961613 #>>42962243 #>>42962387 #>>42965540 #>>42983927 #
1. siva7 ◴[] No.42958781[source]
You‘re describing yesterdays world. With the advancement of AI, there is no need for any of these many steps and stages of OCR anymore. There is no need for XML in your pipeline because Markdown is now equally suited for machine consumption by AI models.
replies(2): >>42959526 #>>42960042 #
2. llm_trw ◴[] No.42959526[source]
The results we got 18 months ago are still better than the current gemini benchmarks at a fraction the cost.

As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.

3. JohnKemeny ◴[] No.42960042[source]
Just commenting here so that I can find back to this comment later. You perfectly captured the AI hype in one small paragraph.
replies(3): >>42960857 #>>42961253 #>>42962399 #
4. fransje26 ◴[] No.42960857[source]
Hey, why settle for yesteryear's world, with better accuracy, lower costs and local deployment, if you can use today's new shinny tool, reinvent the wheel, put everything in the cloud, and get hallucination for free..
replies(1): >>42970939 #
5. raincole ◴[] No.42961253[source]
Just commenting here to say the GP is spot on.

If you already have a high optimized pipeline built yesterday, then sure keep using it.

But if you start dealing with PDF today, just use Gemini. Use the most human readable formats you can find because we know AI will be optimized on understanding that. Don't even think about "stitching XML files" blahblah.

replies(1): >>42962037 #
6. aiono ◴[] No.42962037{3}[source]
Except it's more expensive, hallucinates and you are vendor locked.
replies(1): >>43069374 #
7. tzs ◴[] No.42962399[source]
For future reference if you click on the timestamp of a comment that will bring you to a screen that has a “favorite” link. Click that to add the comment to your favorite comments list, which you can find on your profile page.
8. BenGosub ◴[] No.42970939{3}[source]
What are the tools from the yesterday's world you are referring to? I've had issues with the base Python library in PDF parsing, only some state of the art tools were able to parse the information correctly.
9. bitdribble ◴[] No.43069374{4}[source]
Why do you say you are vendor locked? There are 4-5 top of the line LLMs that support structured output and compete with Gemini. Once an LLM vendor has the pipeline built for structured output, they'll pass each new model through the pipeline.