←back to thread

211 points themanmaran | 1 comments | | HN request time: 0.356s | source

Last week was big for open source LLMs. We got:

- Qwen 2.5 VL (72b and 32b)

- Gemma-3 (27b)

- DeepSeek-v3-0324

And a couple weeks ago we got the new mistral-ocr model. We updated our OCR benchmark to include the new models.

We evaluated 1,000 documents for JSON extraction accuracy. Major takeaways:

- Qwen 2.5 VL (72b and 32b) are by far the most impressive. Both landed right around 75% accuracy (equivalent to GPT-4o’s performance). Qwen 72b was only 0.4% above 32b. Within the margin of error.

- Both Qwen models passed mistral-ocr (72.2%), which is specifically trained for OCR.

- Gemma-3 (27B) only scored 42.9%. Particularly surprising given that it's architecture is based on Gemini 2.0 which still tops the accuracy chart.

The data set and benchmark runner is fully open source. You can check out the code and reproduction steps here:

- https://getomni.ai/blog/benchmarking-open-source-models-for-...

- https://github.com/getomni-ai/benchmark

- https://huggingface.co/datasets/getomni-ai/ocr-benchmark

1. fpgaminer ◴[] No.43551927[source]
I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.

In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.

What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.

Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.