(github.com)

169 points Tammilore | 1 comments | 18 Nov 24 10:51 UTC | HN request time: 0.212s | source

Documind is an open-source tool that turns documents into structured data using AI.

What it does:

- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition

Just run npm install documind to get started.

Show context

vr46 ◴[18 Nov 24 19:20 UTC] No.42175881[source]▶

>>42171311 (OP) #

I’ll have to test this against my local Python pipeline which does all this without an LLM in attendance. There are a ton of existing Python libraries which have been doing this for a long time, so let’s take a look..

replies(1): >>42176786 #

thegabriele ◴[18 Nov 24 20:44 UTC] No.42176786[source]▶

>>42175881 #

Care to share the best ones for some use cases? Thanks

replies(1): >>42177301 #

1. vr46 ◴[18 Nov 24 21:31 UTC] No.42177301[source]▶

>>42176786 #

MinerU

PDFQuery

PyMuPDF (having more success with older versions, right now)

↑

Show HN: Documind – Open-source AI tool to turn documents into structured data