Not really this application, but QvQ for visual reasoning is also impressive. https://qwenlm.github.io/blog/qvq-72b-preview/
Meta has used Qwen as the basis for their Apollo research. https://arxiv.org/abs/2412.10360
We’ve locally tested with Llama 3.2 11B Vision on Ollama: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...
FWIW I think Ollama structured outputs API is quite buggy compared to the HF transformers variant.
So you end up hitting roadblocks for seemingly simple Pydantic schemas.
But they seem to be considered disparate concepts. So I'm trying to understand if there's some additional nuance I'm missing.
A few video schemas are already added to the main catalog: https://github.com/vlm-run/vlmrun-hub/blob/main/vlmrun/hub/c...
You can see some of the qualitative results on GPT4o, Gemini, Llama 3.2 11B, Phi-4 here: https://github.com/vlm-run/vlmrun-hub?tab=readme-ov-file#-qu...
I've generally found json-mode to be more useful than function-calling, even though the latter is what everyone fixates on because of it's obvious use in agents.
If you haven’t heard of us, we provide a language and runtime that enable defining your schemas in a simpler syntax, and allow usage with _any_ model, not just those that implement tool calling or json mode, by by relying on schema-aligned parsing. Check it out! https://github.com/BoundaryML/baml
git config --global init.defaultBranch master
There's the equivalent setting in GitHub.
What’s the use-case and what kind of latency do you require?