←back to thread

58 points prats226 | 1 comments | | HN request time: 1.075s | source

OCR/Document extraction field has seen lot of action recently with releases like Mixtral OCR, Andrew Ng's agentic document processing etc. Also there are several benchmarks for OCR, however all testing for something slightly different which make good comparison of models very hard.

To give an example, some models like mixtral-ocr only try to convert a document to markdown format. You have to use another LLM on top of it to get the final result. Some VLM’s directly give structured information like key fields from documents like invoices, but you have to either add business rules on top of it or use some LLM as a judge kind of system to get sense of which output needs to be manually reviewed or can be taken as correct output. No benchmark attempts to measure the actual rate of automation you can achieve.

We have tried to solve this problem with a benchmark that is only applicable for documents/usecases where you are looking for automation and its trying to measure that end to end automation level of different models or systems.

We have collected a dataset that represents documents like invoices etc which are applicable in processes where automation is needed vs are more copilot in nature where you would need to chat with document. Also have annotated these documents and published the dataset and repo so it can be extended.

Here is writeup: https://nanonets.com/automation-benchmark Dataset: https://huggingface.co/datasets/nanonets/nn-auto-bench-ds Github: https://github.com/NanoNets/nn-auto-bench

Looking for suggestions on how this benchmark can be improved further.

Show context
themanmaran ◴[] No.43366535[source]
Love to see another benchmark! We published the OmniAI OCR benchmark the other week. Thanks for adding us to the list.

One question on the "Automation" score in the results, is this a function of extraction accuracy vs the accuracy of the LLM's "confidence score". I noticed the "accuracy" column was very tightly grouped (between 79 & 84%) but the automation score was way more variable.

And side note: is there an open source Mistral benchmark for their latest OCR model? I know they claimed it was 95% accurate, but it looks that was based on an internal evaluation.

replies(1): >>43366764 #
1. prats226 ◴[] No.43366764[source]
Automation is combination of both, accuracy and accuracy of confidence scores.

Good way to think about automation is recall at high precision which is what you need for true automation where you don't worry about documents that are very likely to have correct results and focus on manually correcting documents likely to have errors.

The reason accuracies are tighly grouped but not the automation is because these models are trained to be accurate but not necessarily predictable, where there is no real way to get confidence score calibrated.

Couldn't find the benchmark mistral used as well