Ingesting PDFs and why Gemini 2.0 changes everything

It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.

I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.

This requires things like:

- state-of-the-art parsing powered by VLMs and OCR

- multi-step extraction powered by semantic chunking, bounding boxes, and citations

- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)

- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy

- evaluation and benchmarking tools

- fine-tuning pipelines that turn reviewed corrections —> custom models

Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.

[1] https://extend.app/