←back to thread

208 points themanmaran | 8 comments | | HN request time: 0.519s | source | bottom

Last week was big for open source LLMs. We got:

- Qwen 2.5 VL (72b and 32b)

- Gemma-3 (27b)

- DeepSeek-v3-0324

And a couple weeks ago we got the new mistral-ocr model. We updated our OCR benchmark to include the new models.

We evaluated 1,000 documents for JSON extraction accuracy. Major takeaways:

- Qwen 2.5 VL (72b and 32b) are by far the most impressive. Both landed right around 75% accuracy (equivalent to GPT-4o’s performance). Qwen 72b was only 0.4% above 32b. Within the margin of error.

- Both Qwen models passed mistral-ocr (72.2%), which is specifically trained for OCR.

- Gemma-3 (27B) only scored 42.9%. Particularly surprising given that it's architecture is based on Gemini 2.0 which still tops the accuracy chart.

The data set and benchmark runner is fully open source. You can check out the code and reproduction steps here:

- https://getomni.ai/blog/benchmarking-open-source-models-for-...

- https://github.com/getomni-ai/benchmark

- https://huggingface.co/datasets/getomni-ai/ocr-benchmark

1. ks2048 ◴[] No.43551523[source]
I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.

For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.

replies(4): >>43551752 #>>43551756 #>>43552076 #>>43552641 #
2. jsight ◴[] No.43551752[source]
I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.

I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.

3. kapitalx ◴[] No.43551756[source]
If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.
replies(1): >>43551762 #
4. kapitalx ◴[] No.43551762[source]
In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(
5. michaelt ◴[] No.43552076[source]
qwen2.5-vl-72b-instruct seems perfectly happy outputting bounding boxes in my testing.

There's also a paper https://arxiv.org/pdf/2409.12191 where they explicitly say some of their training included bounding boxes and coordinates.

replies(1): >>43552704 #
6. chpatrick ◴[] No.43552641[source]
Actually qwen 2.5 is trained to provide bounding boxes
replies(1): >>43553085 #
7. themanmaran ◴[] No.43552704[source]
We're also looking to test qwen and other for the bounding box support. Simon Willison had a great demo page where he used Gemini 2.5 to draw bounding boxes, and the results were pretty impressive. It would probably be pretty easy to drop qwen into the same UI.

https://simonwillison.net/2025/Mar/25/gemini

8. deepsquirrelnet ◴[] No.43553085[source]
Yep, this is true. I was poking around on their github and they have examples in their “cookbooks” section. Eg:

https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr...