←back to thread

1303 points serjester | 2 comments | | HN request time: 0.452s | source
Show context
kbyatnal ◴[] No.42955236[source]
It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.

I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.

This requires things like:

- state-of-the-art parsing powered by VLMs and OCR

- multi-step extraction powered by semantic chunking, bounding boxes, and citations

- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)

- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy

- evaluation and benchmarking tools

- fine-tuning pipelines that turn reviewed corrections —> custom models

Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.

[1] https://extend.app/

replies(2): >>42955931 #>>42959543 #
anon373839 ◴[] No.42955931[source]
> It's clear that OCR & document parsing are going to be swallowed up by these multimodal models.

I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.

replies(2): >>42956985 #>>42959372 #
1. nnurmanov ◴[] No.42959372[source]
If you see above, someone is using a second and even third LLM to correct LLM outputs, I think it is the way to minimize hallucinations.
replies(1): >>42960032 #
2. otabdeveloper4 ◴[] No.42960032[source]
> I think it is the way to minimize hallucinations

Or maybe the way to add new hallucinations. Nobody really knows. Just trust us bro, this is groundbreaking disruptive technology.