←back to thread

208 points themanmaran | 1 comments | | HN request time: 0.201s | source

Last week was big for open source LLMs. We got:

- Qwen 2.5 VL (72b and 32b)

- Gemma-3 (27b)

- DeepSeek-v3-0324

And a couple weeks ago we got the new mistral-ocr model. We updated our OCR benchmark to include the new models.

We evaluated 1,000 documents for JSON extraction accuracy. Major takeaways:

- Qwen 2.5 VL (72b and 32b) are by far the most impressive. Both landed right around 75% accuracy (equivalent to GPT-4o’s performance). Qwen 72b was only 0.4% above 32b. Within the margin of error.

- Both Qwen models passed mistral-ocr (72.2%), which is specifically trained for OCR.

- Gemma-3 (27B) only scored 42.9%. Particularly surprising given that it's architecture is based on Gemini 2.0 which still tops the accuracy chart.

The data set and benchmark runner is fully open source. You can check out the code and reproduction steps here:

- https://getomni.ai/blog/benchmarking-open-source-models-for-...

- https://github.com/getomni-ai/benchmark

- https://huggingface.co/datasets/getomni-ai/ocr-benchmark

Show context
daemonologist ◴[] No.43550948[source]
You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)
replies(1): >>43551259 #
themanmaran ◴[] No.43551259[source]
Yes, I'll add that to the writeup! You're right, initially excluded it because it was really dependent on the providers, so lots of variance. Especially with the Qwen models.

High level results were:

- Qwen 32b => $0.33/1000 pages => 53s/page

- Qwen 72b => $0.71/1000 pages => 51s/page

- Llama 90b => $8.50/1000 pages => 44s/page

- Llama 11b => $0.21/1000 pages => 08s/page

- Gemma 27b => $0.25/1000 pages => 22s/page

- Mistral => $1.00/1000 pages => 03s/page

replies(2): >>43551589 #>>43551686 #
dylan604 ◴[] No.43551589[source]
One of these things is not like the others. $8.50/1000?? Any chance that's a typo? Otherwise, for someone that has no experience with LLM pricing models, why is Llama 90b so expensive?
replies(2): >>43551806 #>>43552161 #
int_19h ◴[] No.43552161[source]
It's not uncommon when using brokers to see outliers like this. What happens basically is that some models are very popular and have many different providers, and are priced "close to the metal" since the routing will normally pick the cheapest option with the specified requirements (like context size). But then other models - typically more specialized ones - are only hosted by a single provider, and said provider can then price it much higher than raw compute cost.

E.g. if you look at https://openrouter.ai/models?order=pricing-high-to-low, you'll see that there are some 7B and 8B models that are more expensive than Claude Sonnet 3.7.

replies(1): >>43556385 #
1. nickpsecurity ◴[] No.43556385[source]
I'll add that some, big-name suppliers with big models might be running at or near a loss on purpose to draw in customers. That behavior is often encouraged by funders who gave them over $100 million to capture the market.

Their theory is they can raise prices once their competitors go out of business. The companies open-sourcing pretrained models are countering that. So, we see a mix of huge models underpriced by scheming companies and open-source models priced for inference with free market principles.