←back to thread

1303 points serjester | 1 comments | | HN request time: 0.21s | source
Show context
kbyatnal ◴[] No.42955236[source]
It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.

I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.

This requires things like:

- state-of-the-art parsing powered by VLMs and OCR

- multi-step extraction powered by semantic chunking, bounding boxes, and citations

- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)

- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy

- evaluation and benchmarking tools

- fine-tuning pipelines that turn reviewed corrections —> custom models

Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.

[1] https://extend.app/

replies(2): >>42955931 #>>42959543 #
1. esjeon ◴[] No.42959543[source]
I think professional services will continue to use OCRs in one way or another, because it's simply too cheap, fast, and accurate. Perhaps, multi-modal models can help address shortcomings of OCRs, like layout detection and guessing unrecognizable characters.