←back to thread

169 points Tammilore | 7 comments | | HN request time: 0s | source | bottom

Documind is an open-source tool that turns documents into structured data using AI.

What it does:

- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition

Just run npm install documind to get started.

Show context
emmanueloga_ ◴[] No.42173837[source]
From the source, Documind appears to:

1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.

Some issues with this approach:

* OpenAI may retain and use your data for training, raising privacy concerns [1].

* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.

* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.

* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!

While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.

---

1: https://platform.openai.com/docs/models#how-we-use-your-data

2: https://github.com/axa-group/Parsr

replies(5): >>42175186 #>>42176460 #>>42176836 #>>42178185 #>>42195512 #
1. groby_b ◴[] No.42176460[source]
That's not what [1] says, though? Quoth: "As of March 1, 2023, data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt-in to share data with us, such as by providing feedback in the Playground). "

"Traditional methods (PDF parsers with OCR support) are cheaper, more reliable"

Not sure on the reliability - the ones I'm using all fail at structured data. You want a table extracted from a PDF, LLMs are your friend. (Recommendations welcome)

replies(2): >>42176810 #>>42179086 #
2. niklasd ◴[] No.42176810[source]
We found that for extracting tables, OpenAIs LLMs aren't great. What is working well for us is Docling (https://github.com/DS4SD/docling/)
replies(2): >>42178239 #>>42180258 #
3. soci ◴[] No.42178239[source]
agreed, extracting tables in pdfs using any of the available openAI models has been a waste of prompting time here too.
4. emmanueloga_ ◴[] No.42179086[source]
> That's not what [1] says, though?

Documind is using https://api.openai.com/v1/chat/completions, check the docs at the end of the long API table [1]:

> * Chat Completions:

> Image inputs via the gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention."

--

1: https://platform.openai.com/docs/models#how-we-use-your-data

replies(1): >>42188577 #
5. emmanueloga_ ◴[] No.42180258[source]
Haven't seen Docling before, it looks great! Thanks for sharing.
6. groby_b ◴[] No.42188577[source]
Thanks for pointing there!

It's still not used for training, though, and the retention period is 30 days. It's... a livable compromise for some(many) use cases.

I kind of get the abuse policy reason for image inputs. It makes sense for multi-turn conversations to require a 1h audio retention, too. I'm just incredibly puzzled why schemas for structured outputs aren't eligible for zero-retention.

replies(1): >>42190973 #
7. emmanueloga_ ◴[] No.42190973{3}[source]
Gotcha, from what I could find online I think you are right. I was conflating data not under zero-retention-policy with data-for-training.