Show HN: Documind – Open-source AI tool to turn documents into structured data

1. infecto ◴[18 Nov 24 17:22 UTC] No.42174595[source]▶

Multimodal LLM are not the way to do this for a business workflow yet.

In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.

replies(2): >>42176035 #>>42176122 #

2. IndieCoder ◴[18 Nov 24 19:34 UTC] No.42176035[source]▶

>>42174595 (TP) #

Plus one, using the exact setup to make it scale. If Azure Doc Intelligence gets too expensive, VLMs also work great

replies(1): >>42177063 #

3. disgruntledphd2 ◴[18 Nov 24 19:44 UTC] No.42176122[source]▶

>>42174595 (TP) #

> AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it.

Do they work for Bills of Lading yet? When I tested a sample of these bills a few years back (2022 I think), the results were not good at all. But I honestly wouldn't be surprised if they'd massively improved lately.

replies(1): >>42178301 #

4. vinothgopi ◴[18 Nov 24 21:08 UTC] No.42177063[source]▶

>>42176035 #

What is a VLM?

replies(1): >>42177860 #

5. saharhash ◴[18 Nov 24 22:37 UTC] No.42177860{3}[source]▶

>>42177063 #

Vision Language Model like Qwen VL https://github.com/QwenLM/Qwen2-VL or CoPali https://huggingface.co/blog/manu/colpali

replies(1): >>42195886 #

6. infecto ◴[18 Nov 24 23:22 UTC] No.42178301[source]▶

>>42176122 #

Have not used in on your docs but I can say that it definitely works well with forms and forms with tables like a Bill of Lading. It costs extra but you need to turn on table extract (at least in AWS). You then can get a markdown representation of that page include table, you can of course pull out the table itself but unless its standardized you will need the middleman LLM figuring out the exact data/structure you are looking for.

replies(1): >>42193210 #

7. disgruntledphd2 ◴[20 Nov 24 12:15 UTC] No.42193210{3}[source]▶

>>42178301 #

Huh, interesting. I'll have to try again next time I need to parse stuff like this.

8. sidmo ◴[20 Nov 24 16:59 UTC] No.42195886{4}[source]▶

>>42177860 #

VLMs are cool - they generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. Here's an open-source API demo I built if you want to try it out: https://github.com/DataFog/vlm-api