I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.
This requires things like:
- state-of-the-art parsing powered by VLMs and OCR
- multi-step extraction powered by semantic chunking, bounding boxes, and citations
- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)
- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy
- evaluation and benchmarking tools
- fine-tuning pipelines that turn reviewed corrections —> custom models
Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.