←back to thread

169 points Tammilore | 2 comments | | HN request time: 0s | source

Documind is an open-source tool that turns documents into structured data using AI.

What it does:

- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition

Just run npm install documind to get started.

Show context
infecto ◴[] No.42174595[source]
Multimodal LLM are not the way to do this for a business workflow yet.

In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.

replies(2): >>42176035 #>>42176122 #
IndieCoder ◴[] No.42176035[source]
Plus one, using the exact setup to make it scale. If Azure Doc Intelligence gets too expensive, VLMs also work great
replies(1): >>42177063 #
vinothgopi ◴[] No.42177063[source]
What is a VLM?
replies(1): >>42177860 #
1. saharhash ◴[] No.42177860[source]
Vision Language Model like Qwen VL https://github.com/QwenLM/Qwen2-VL or CoPali https://huggingface.co/blog/manu/colpali
replies(1): >>42195886 #
2. sidmo ◴[] No.42195886[source]
VLMs are cool - they generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. Here's an open-source API demo I built if you want to try it out: https://github.com/DataFog/vlm-api