Show HN: Documind – Open-source AI tool to turn documents into structured data

(github.com)

169 points Tammilore | 2 comments | 18 Nov 24 10:51 UTC | HN request time: 0.523s | source

Documind is an open-source tool that turns documents into structured data using AI.

What it does:

- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition

Just run npm install documind to get started.

Show context

emmanueloga_ ◴[18 Nov 24 16:16 UTC] No.42173837[source]▶

>>42171311 (OP) #

From the source, Documind appears to:

1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.

Some issues with this approach:

* OpenAI may retain and use your data for training, raising privacy concerns [1].

* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.

* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.

* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!

While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.

---

1: https://platform.openai.com/docs/models#how-we-use-your-data

2: https://github.com/axa-group/Parsr

replies(5): >>42175186 #>>42176460 #>>42176836 #>>42178185 #>>42195512 #

themanmaran ◴[18 Nov 24 23:10 UTC] No.42178185[source]▶

>>42173837 #

Disappointed to see this is an exact rip of our open source tool zerox [1]. With no attribution. They also took the MIT License and changed it out for an AGPL.

If you inspect the source code, it's a verbatim copy. They literally just renamed the ZeroxOutput to DocumindOutput [2][3]

[1] https://github.com/getomni-ai/zerox

[2] https://github.com/DocumindHQ/documind/blob/main/core/src/ty...

[3] https://github.com/getomni-ai/zerox/blob/main/node-zerox/src...

replies(3): >>42178533 #>>42178736 #>>42200734 #

Tammilore ◴[19 Nov 24 00:12 UTC] No.42178736[source]▶

>>42178185 #

Hello. I apologize that it came across this way. This was not the intention. Zerox was definitely used and I made sure to copy and include the MIT license exactly as it was inside the part of the code that uses Zerox.

If there's any additional thing I can do, please let me know so I would make all amendements immediately.

replies(1): >>42200920 #

1. gmerc ◴[21 Nov 24 03:50 UTC] No.42200920[source]▶

>>42178736 #

You took their code, did a search and replace on the product name and you're relicensed the code AGPL?

You're going to have to delete this thing and start over man.

replies(1): >>42203869 #

2. leojaygod ◴[21 Nov 24 12:55 UTC] No.42203869[source]▶

>>42200920 (TP) #

It appears that the MIT license was correctly included to apply to the zerox code used while the AGPL license applies to their own code. Isn’t this how it should be?

↑