Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 2 comments | 05 Feb 25 18:05 UTC | HN request time: 0.003s | source

Show context

kbyatnal ◴[05 Feb 25 21:14 UTC] No.42955236[source]▶

It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.

I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.

This requires things like:

- state-of-the-art parsing powered by VLMs and OCR

- multi-step extraction powered by semantic chunking, bounding boxes, and citations

- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)

- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy

- evaluation and benchmarking tools

- fine-tuning pipelines that turn reviewed corrections —> custom models

Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.

[1] https://extend.app/

replies(2): >>42955931 #>>42959543 #

anon373839 ◴[05 Feb 25 22:04 UTC] No.42955931[source]▶

>>42955236 #

> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.

I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.

replies(2): >>42956985 #>>42959372 #

1. nnurmanov ◴[06 Feb 25 05:24 UTC] No.42959372[source]▶

>>42955931 #

If you see above, someone is using a second and even third LLM to correct LLM outputs, I think it is the way to minimize hallucinations.

replies(1): >>42960032 #

2. otabdeveloper4 ◴[06 Feb 25 07:37 UTC] No.42960032[source]▶

>>42959372 (TP) #

> I think it is the way to minimize hallucinations

Or maybe the way to add new hallucinations. Nobody really knows. Just trust us bro, this is groundbreaking disruptive technology.

↑