Most active commenters

fzysingularity(9)
EarlyOom(4)
jasonjmcghee(4)

Run structured extraction on documents/images locally with Ollama and Pydantic

(github.com)

1. EarlyOom ◴[20 Feb 25 01:54 UTC] No.43110174[source]▶

We put together an open-source collection of Pydantic schemas for a variety of document categories (W2 filings, invoices etc.), including instructions for how to get structured JSON responses from any visual input with the model of your choosing. Run everything locally.

2. jbmsf ◴[20 Feb 25 03:10 UTC] No.43110666[source]▶

>>43110173 (OP) #

Interesting. We're using a SAAS solution for document extraction right now. I don't know if it's in our interest to build out more but I do like the idea of keeping extraction local.

replies(2): >>43110817 #>>43111598 #

3. jgalt212 ◴[20 Feb 25 03:38 UTC] No.43110817[source]▶

>>43110666 #

Our customers insist we run everything on their docs locally.

replies(1): >>43111609 #

4. fzysingularity ◴[20 Feb 25 05:59 UTC] No.43111598[source]▶

>>43110666 #

Cool, what types of documents do you currently handle? We could share some of our learnings/schemas here too.

replies(2): >>43111792 #>>43124212 #

5. fzysingularity ◴[20 Feb 25 06:01 UTC] No.43111609{3}[source]▶

>>43110817 #

Absolutely, we’ve been hearing the same from our customers - which is why we thought it makes sense to open source a bunch of schemas so that they’re reusable and compatible across various inference providers (esp. Ollama/local ones).

6. jasonjmcghee ◴[20 Feb 25 06:13 UTC] No.43111653[source]▶

>>43110173 (OP) #

I've used "structured output" (with supplied schema) on Google and openai, and function calling / tool use on those, anthropic and others- and afaict they are functionally the same (if you force a specific function / schema). Has someone had a different experience?

replies(2): >>43111719 #>>43112057 #

7. kaushikbokka ◴[20 Feb 25 06:13 UTC] No.43111654[source]▶

>>43110173 (OP) #

Have you folks tried finetuning models for data extraction from visual data?

replies(1): >>43111724 #

8. jauntywundrkind ◴[20 Feb 25 06:18 UTC] No.43111690[source]▶

>>43110173 (OP) #

I'd really like to play with Qwen2.5-VL at some point, perhaps for reading data-sheets for microchips. Nicely for some applications, it's also very good at reporting position of what it finds, which many ML tools are pretty mediocre at. https://qwenlm.github.io/blog/qwen2.5-vl/

Not really this application, but QvQ for visual reasoning is also impressive. https://qwenlm.github.io/blog/qvq-72b-preview/

Meta has used Qwen as the basis for their Apollo research. https://arxiv.org/abs/2412.10360

replies(1): >>43111708 #

9. fzysingularity ◴[20 Feb 25 06:21 UTC] No.43111708[source]▶

>>43111690 #

Is Qwen2.5-VL on Ollama? Could give it a try with a few of the schemas we have.

We’ve locally tested with Llama 3.2 11B Vision on Ollama: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...

FWIW I think Ollama structured outputs API is quite buggy compared to the HF transformers variant.

replies(1): >>43141062 #

10. fzysingularity ◴[20 Feb 25 06:24 UTC] No.43111719[source]▶

>>43111653 #

They’re slightly nuanced - every model provider has a slightly different Pydantic /JSON schema compatibility (i.e for handling Literals, Unions, nested subtypes etc).

So you end up hitting roadblocks for seemingly simple Pydantic schemas.

replies(1): >>43111828 #

11. EarlyOom ◴[20 Feb 25 06:25 UTC] No.43111724[source]▶

>>43111654 #

That's one of our main focuses, yes: https://docs.vlm.run/api-reference/v1/fine-tuning/post-finet...

12. youknowwhentous ◴[20 Feb 25 06:39 UTC] No.43111791[source]▶

>>43110173 (OP) #

This seems to work for videos as well. Pretty cool demo and very nice interface for the pydantic types.

replies(1): >>43111838 #

13. andrewinardeer ◴[20 Feb 25 06:39 UTC] No.43111792{3}[source]▶

>>43111598 #

Different commenter; Here I'm extracting data from commerical invoices, POs and bills of lading.

replies(1): >>43111865 #

14. jasonjmcghee ◴[20 Feb 25 06:47 UTC] No.43111828{3}[source]▶

>>43111719 #

I meant between "structured output" and "function calling". Afaict one is outputting according to a schema and the other is outputting according to a schema... which will be used as the parameters to a function.

But they seem to be considered disparate concepts. So I'm trying to understand if there's some additional nuance I'm missing.

replies(2): >>43111942 #>>43111965 #

15. fzysingularity ◴[20 Feb 25 06:49 UTC] No.43111838[source]▶

>>43111791 #

Yes, good catch. We'll be adding several more schemas for videos in the next few weeks.

A few video schemas are already added to the main catalog: https://github.com/vlm-run/vlmrun-hub/blob/main/vlmrun/hub/c...

16. fzysingularity ◴[20 Feb 25 06:53 UTC] No.43111865{4}[source]▶

>>43111792 #

Ah cool, care to share a few examples? We can probably add those schemas in the next few days if there's enough folks who could benefit from this. A basic invoice schema is already there: https://github.com/vlm-run/vlmrun-hub/blob/main/vlmrun/hub/s...

You can see some of the qualitative results on GPT4o, Gemini, Llama 3.2 11B, Phi-4 here: https://github.com/vlm-run/vlmrun-hub?tab=readme-ov-file#-qu...

17. 18chetanpatel ◴[20 Feb 25 07:08 UTC] No.43111932[source]▶

>>43110173 (OP) #

This is something I was searching for..Thanks for creating!

18. guntars ◴[20 Feb 25 07:10 UTC] No.43111942{4}[source]▶

>>43111828 #

With function calls the model may or may not output something that matches the schema, with structured output the schema is enforced at the logit level.

replies(1): >>43114509 #

19. fzysingularity ◴[20 Feb 25 07:15 UTC] No.43111965{4}[source]▶

>>43111828 #

Ah ok, I misunderstood. As far as I've seen, structured outputs is essentially "json-mode" with some constraints (i.e. guided decoding over a known schema) - so the model effectively emits valid JSON that conforms to the schema. In function calling, the model is asked to emit "code" that conforms to some function parameter spec. You could use json-mode for function-calling, but probably not the other way around.

I've generally found json-mode to be more useful than function-calling, even though the latter is what everyone fixates on because of it's obvious use in agents.

replies(1): >>43114543 #

20. potatoman22 ◴[20 Feb 25 07:30 UTC] No.43112057[source]▶

>>43111653 #

The model might not use the tools every completion, depending on your setup.

21. joatmon-snoo ◴[20 Feb 25 08:30 UTC] No.43112396[source]▶

>>43110173 (OP) #

Super cool! We at BAML had been thinking about doing something like this for our ecosystem as well - we’d love to add BAML models to this repo!

If you haven’t heard of us, we provide a language and runtime that enable defining your schemas in a simpler syntax, and allow usage with _any_ model, not just those that implement tool calling or json mode, by by relying on schema-aligned parsing. Check it out! https://github.com/BoundaryML/baml

replies(1): >>43119750 #

22. Inviz ◴[20 Feb 25 09:49 UTC] No.43112884[source]▶

>>43110173 (OP) #

What are the most promising ways to extract information from picture like this, if the domain has strict time constraints? What's the second best way that is still fast?

replies(1): >>43115335 #

23. peterhadlaw ◴[20 Feb 25 12:30 UTC] No.43113928[source]▶

>>43110173 (OP) #

When making a new repo, reset your initial branch back to master with the following command:

git config --global init.defaultBranch master

There's the equivalent setting in GitHub.

24. jasonjmcghee ◴[20 Feb 25 13:41 UTC] No.43114509{5}[source]▶

>>43111942 #

At least in the case of openai, you can set "strict" to "true" and function calling / tool use must / is enforced to follow the schema too.

25. jasonjmcghee ◴[20 Feb 25 13:44 UTC] No.43114543{5}[source]▶

>>43111965 #

I don't understand the difference based on your explanation (or the significance of "code") and have used function calling for outputting json according to a schema.

26. ◴[20 Feb 25 14:31 UTC] No.43115087[source]▶

>>43110173 (OP) #

27. fzysingularity ◴[20 Feb 25 14:50 UTC] No.43115335[source]▶

>>43112884 #

You can always distill VLMs into much smaller / faster models that’s specific to your domain or use-case.

What’s the use-case and what kind of latency do you require?

28. ◴[20 Feb 25 15:09 UTC] No.43115613[source]▶

>>43110173 (OP) #

29. EarlyOom ◴[20 Feb 25 20:28 UTC] No.43119750[source]▶

>>43112396 #

Would love to chat! reach out scott@vlm.run

30. jbmsf ◴[21 Feb 25 05:02 UTC] No.43124212{3}[source]▶

>>43111598 #

Mostly tax forms, state-specific formations documents (Articles of X), and state-specific payroll registration documents.

31. fzysingularity ◴[22 Feb 25 17:32 UTC] No.43141062{3}[source]▶

>>43111708 #

Just ran them for Qwen2.5-VL: https://github.com/vlm-run/vlmrun-hub/blob/main/tests/benchm...

↑