Most active commenters

Tammilore(9)
(5)
emmanueloga_(4)
sidmo(3)

Popular/hot comments

>>42171899 #
>>42173837 #
>>42178185 #

Show HN: Documind – Open-source AI tool to turn documents into structured data

(github.com)

Documind is an open-source tool that turns documents into structured data using AI.

What it does:

- Extracts specific data from PDFs based on your custom schema - Returns clean, structured JSON that's ready to use - Works with just a PDF link + your schema definition

Just run npm install documind to get started.

1. bob778 ◴[18 Nov 24 12:31 UTC] No.42171806[source]▶

>>42171311 (OP) #

From just reading the README, the example is not valid JSON. Is that intentional?

Otherwise it seems like a prompt building tool, or am I missing something here?

replies(2): >>42171913 #>>42172158 #

2. rkuodys ◴[18 Nov 24 12:38 UTC] No.42171839[source]▶

>>42171311 (OP) #

Just this weekend was solving similar problem.

What I've noticed, that on scanned documents, where stamp-text and handwriting is just as important as printed text, Gemini was way better compared to chat gpt.

Of course, my prompts might have been an issue, but gemini with very brief and generic queries made significantly better results.

3. inexcf ◴[18 Nov 24 12:51 UTC] No.42171899[source]▶

>>42171311 (OP) #

Got excited about an open-source tool doing this.

Alas, i am let down. It is an open-source tool creating the prompt for the OpenAI API and i can't go and send customer data to them.

I'm aware of https://github.com/clovaai/donut so i hoped this would be more like that.

replies(5): >>42171944 #>>42171963 #>>42172184 #>>42172234 #>>42195901 #

4. assanineass ◴[18 Nov 24 12:53 UTC] No.42171913[source]▶

>>42171806 #

Oof you’re right LOL

5. _joel ◴[18 Nov 24 12:59 UTC] No.42171944[source]▶

>>42171899 #

You can self host OpenAPI compatible models with lmstudio and the like. I've used it with https://anythingllm.com/

6. danbruc ◴[18 Nov 24 13:02 UTC] No.42171959[source]▶

>>42171311 (OP) #

With such a system, how do you ensure that the extracted data matches the data in the source document? Run the process several times and check that the results are identical? Can it reject inputs for manual processing? Or is it intended to be always checked manually? How good is it, how many errors does it make, say per million extracted values?

replies(1): >>42172472 #

7. turblety ◴[18 Nov 24 13:03 UTC] No.42171963[source]▶

>>42171899 #

You might be able to use Ollama, which has a OpenAI compatible API.

replies(1): >>42172054 #

8. khaki54 ◴[18 Nov 24 13:13 UTC] No.42172027[source]▶

>>42171311 (OP) #

Not sure I would want something non-deterministic in my data pipeline. Maybe if it used GenAI to _develop a ruleset_ that could then be deployed, it would be more practical.

9. Zambyte ◴[18 Nov 24 13:17 UTC] No.42172054{3}[source]▶

>>42171963 #

Not without chaning the code (should be easy though)

https://github.com/DocumindHQ/documind/blob/d91121739df03867...

10. avereveard ◴[18 Nov 24 13:23 UTC] No.42172088[source]▶

>>42171311 (OP) #

> an interesting open source project

enthusiastically setting up a lounge chair

> OPENAI_API_KEY=your_openai_api_key

carrying it back apathetically

replies(2): >>42172205 #>>42172807 #

11. gibsonf1 ◴[18 Nov 24 13:25 UTC] No.42172102[source]▶

>>42171311 (OP) #

I'm not sure having statistics with fabrication try to extract text from PDF's would result in any mission-critical reliable data?

12. Tammilore ◴[18 Nov 24 13:35 UTC] No.42172158[source]▶

>>42171806 #

Thanks for pointing this out. This was an error on my part.

I see someone opened an issue for it so will fix now.

13. Tammilore ◴[18 Nov 24 13:39 UTC] No.42172184[source]▶

>>42171899 #

Hi. I totally get the concern about sending data to OpenAI. Right now, Documind uses OpenAI's API just so people could quickly get started and see what it is like, but I’m open to adding options and contributions that would be better for privacy.

replies(1): >>42188570 #

14. Tammilore ◴[18 Nov 24 13:41 UTC] No.42172205[source]▶

>>42172088 #

Thanks for the laugh and your feedback! I know that depending on an OpenAI isn't ideal for everyone. I'm considering ways to make it more self-contained in the future, so it’s great to hear what users are looking for.

replies(1): >>42175974 #

15. eichi ◴[18 Nov 24 13:43 UTC] No.42172214[source]▶

>>42171311 (OP) #

  const systemPrompt = `
    Convert the following PDF page to markdown.
    Return only the markdown with no explanation text. Do not include deliminators like '''markdown.
    You must include all information on the page. Do not exclude headers, footers, or subtext.
  `;

replies(1): >>42172227 #

16. ◴[18 Nov 24 13:45 UTC] No.42172227[source]▶

>>42172214 #

17. ◴[18 Nov 24 13:46 UTC] No.42172234[source]▶

>>42171899 #

18. thor-rodrigues ◴[18 Nov 24 13:46 UTC] No.42172239[source]▶

>>42171311 (OP) #

Very nice tool! Just last week, I was working on extracting information from PDFs for an automation flow I’m building. I used Unstructured (https://unstructured.io/), which supports multiple file types, not just PDFs.

However, my main issue is that I need to work with confidential client data that cannot be uploaded to a third party. Setting up the open-source, locally hosted version of Unstructured was quite cumbersome due to the numerous additional packages and installation steps required.

While I’m open to the idea of parsing content with an LLM that has vision capabilities, data safety and confidentiality are critical for many applications. I think your project would go from good to great if it would be possible to connect to Ollama and run locally,

That said, this is an excellent application! I can definitely see myself using it in other projects that don’t demand such stringent data confidentiality.”

replies(1): >>42172303 #

19. Tammilore ◴[18 Nov 24 13:56 UTC] No.42172303[source]▶

>>42172239 #

Thank you, I appreciate the feedback! I understand people wanting data confidentiality and I'm considering connecting Ollama for future updates!

20. glorpsicle ◴[18 Nov 24 14:18 UTC] No.42172472[source]▶

>>42171959 #

Perhaps there's still value in the documents being transformed by this tool and someone reviewing them manually, but obviously the real value would be in reducing manual review. I don't think there's a world–for now–in which this manual review can be completely eliminated.

However, if you process, say, 1 million documents, you could sample and review a small percentage of them manually (a power calculation would help here). Assuming your random sample models the "distribution" (which may be tough to define/summarize) of the 1 million documents, you could then extrapolate your accuracy onto the larger set of documents without having to review each and every one.

replies(1): >>42174980 #

21. ◴[18 Nov 24 14:53 UTC] No.42172807[source]▶

>>42172088 #

22. asjfkdlf ◴[18 Nov 24 15:41 UTC] No.42173400[source]▶

>>42171311 (OP) #

I am looking for a similar service that turns any document (PNG, PDf, DocX) into JSON (preserving the field relationships). I tried with ChatGPT, but hallucinations are common. Does anything exist?

replies(2): >>42173587 #>>42173893 #

23. omk ◴[18 Nov 24 15:56 UTC] No.42173587[source]▶

>>42173400 #

This is also using OpenAI's GPT model. So the same hallucinations are probable here for PDFs.

24. hirezeeshan ◴[18 Nov 24 16:02 UTC] No.42173661[source]▶

>>42171311 (OP) #

That's a valid problem you are solving. I had similar usecase that I solved using PDF[dot]co

25. azinman2 ◴[18 Nov 24 16:14 UTC] No.42173795[source]▶

>>42171311 (OP) #

Looking at the source it seems this is just a thin wrapper over OpenAI. Am I missing something?

26. emmanueloga_ ◴[18 Nov 24 16:16 UTC] No.42173837[source]▶

>>42171311 (OP) #

From the source, Documind appears to:

1) Install tools like Ghostscript, GraphicsMagick, and LibreOffice with a JS script. 2) Convert document pages to Base64 PNGs and send them to OpenAI for data extraction. 3) Use Supabase for unclear reasons.

Some issues with this approach:

* OpenAI may retain and use your data for training, raising privacy concerns [1].

* Dependencies should be managed with Docker or package managers like Nix or Pixi, which are more robust. Example: a tool like Parsr [2] provides a Dockerized pdf-to-json solution, complete with OCR support and an HTTP api.

* GPT-4 vision seems like a costly, error-prone, and unreliable solution, not really suited for extracting data from sensitive docs like invoices, without review.

* Traditional methods (PDF parsers with OCR support) are cheaper, more reliable, and avoid retention risks for this particular use case. Although these tools do require some plumbing... probably LLMs can really help with that!

While there are plenty of tools for structured data extraction, I think there’s still room for a streamlined, all-in-one solution. This gap likely explains the abundance of closed-source commercial options tackling this very challenge.

---

1: https://platform.openai.com/docs/models#how-we-use-your-data

2: https://github.com/axa-group/Parsr

replies(5): >>42175186 #>>42176460 #>>42176836 #>>42178185 #>>42195512 #

27. cccybernetic ◴[18 Nov 24 16:21 UTC] No.42173893[source]▶

>>42173400 #

I built a drag-and-drop document converter that extracts text into custom columns (for CSV) or keys (for JSON). You can schedule it to run at certain times and update a database as well.

I haven't had issues with hallucinations. If you're interested, my email is in my bio.

28. infecto ◴[18 Nov 24 17:22 UTC] No.42174595[source]▶

>>42171311 (OP) #

Multimodal LLM are not the way to do this for a business workflow yet.

In my experience your much better of starting with a Azure Doc Intelligence or AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it. From there you can use an LLM to interrogate and structure the data to your hearts delight.

replies(2): >>42176035 #>>42176122 #

29. constantinum ◴[18 Nov 24 17:38 UTC] No.42174779[source]▶

>>42171311 (OP) #

Reading from the comments, some of the common questions regarding document extraction are:

* Run locally or on premise for security/privacy reasons

* Support multiple LLMs and vector DBs - plug and play

* Support customisable schemas

* Method to check/confirm accuracy with source

* Cron jobs for automation

There is Unstract that solves the above requirements.

https://github.com/Zipstack/unstract

30. danbruc ◴[18 Nov 24 17:58 UTC] No.42174980{3}[source]▶

>>42172472 #

You can sample the result to determine the error rate, but if you find an unacceptable level of errors, then you still have to review everything manually. On the other hand, if you use traditional techniques, pattern matching with regular expressions and things like that, then you can probably get pretty close to perfection for those cases where your patterns match and you can just reject the rest for manual processing. Maybe you could ask a language model to compare the source document and the extracted data and to indicate whether there are errors, but I am not sure if that would help, maybe what tripped up the extraction would also trip up the result evaluation.

31. ◴[18 Nov 24 18:15 UTC] No.42175186[source]▶

>>42173837 #

32. vr46 ◴[18 Nov 24 19:20 UTC] No.42175881[source]▶

>>42171311 (OP) #

I’ll have to test this against my local Python pipeline which does all this without an LLM in attendance. There are a ton of existing Python libraries which have been doing this for a long time, so let’s take a look..

replies(1): >>42176786 #

33. avereveard ◴[18 Nov 24 19:29 UTC] No.42175974{3}[source]▶

>>42172205 #

litellm would be a start, then you just pass in a model string that includes the provider, and can default on openai gpts, that removes most of the effort in adapting stuff both from you and other users.

34. IndieCoder ◴[18 Nov 24 19:34 UTC] No.42176035[source]▶

>>42174595 #

Plus one, using the exact setup to make it scale. If Azure Doc Intelligence gets too expensive, VLMs also work great

replies(1): >>42177063 #

35. disgruntledphd2 ◴[18 Nov 24 19:44 UTC] No.42176122[source]▶

>>42174595 #

> AWS Textract to first get the structure of the document (PDF). These tools are incredibly robust and do a great job with most of the common cases you can throw at it.

Do they work for Bills of Lading yet? When I tested a sample of these bills a few years back (2022 I think), the results were not good at all. But I honestly wouldn't be surprised if they'd massively improved lately.

replies(1): >>42178301 #

36. groby_b ◴[18 Nov 24 20:18 UTC] No.42176460[source]▶

>>42173837 #

That's not what [1] says, though? Quoth: "As of March 1, 2023, data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt-in to share data with us, such as by providing feedback in the Playground). "

"Traditional methods (PDF parsers with OCR support) are cheaper, more reliable"

Not sure on the reliability - the ones I'm using all fail at structured data. You want a table extracted from a PDF, LLMs are your friend. (Recommendations welcome)

replies(2): >>42176810 #>>42179086 #

37. thegabriele ◴[18 Nov 24 20:44 UTC] No.42176786[source]▶

>>42175881 #

Care to share the best ones for some use cases? Thanks

replies(1): >>42177301 #

38. niklasd ◴[18 Nov 24 20:46 UTC] No.42176810{3}[source]▶

>>42176460 #

We found that for extracting tables, OpenAIs LLMs aren't great. What is working well for us is Docling (https://github.com/DS4SD/docling/)

replies(2): >>42178239 #>>42180258 #

39. brianjking ◴[18 Nov 24 20:49 UTC] No.42176836[source]▶

>>42173837 #

OpenAI isn't retaining your details sent via the API for training details. Stopp.

40. vinothgopi ◴[18 Nov 24 21:08 UTC] No.42177063{3}[source]▶

>>42176035 #

What is a VLM?

replies(1): >>42177860 #

41. vr46 ◴[18 Nov 24 21:31 UTC] No.42177301{3}[source]▶

>>42176786 #

MinerU

PDFQuery

PyMuPDF (having more success with older versions, right now)

42. saharhash ◴[18 Nov 24 22:37 UTC] No.42177860{4}[source]▶

>>42177063 #

Vision Language Model like Qwen VL https://github.com/QwenLM/Qwen2-VL or CoPali https://huggingface.co/blog/manu/colpali

replies(1): >>42195886 #

43. themanmaran ◴[18 Nov 24 23:10 UTC] No.42178185[source]▶

>>42173837 #

Disappointed to see this is an exact rip of our open source tool zerox [1]. With no attribution. They also took the MIT License and changed it out for an AGPL.

If you inspect the source code, it's a verbatim copy. They literally just renamed the ZeroxOutput to DocumindOutput [2][3]

[1] https://github.com/getomni-ai/zerox

[2] https://github.com/DocumindHQ/documind/blob/main/core/src/ty...

[3] https://github.com/getomni-ai/zerox/blob/main/node-zerox/src...

replies(3): >>42178533 #>>42178736 #>>42200734 #

44. soci ◴[18 Nov 24 23:16 UTC] No.42178239{4}[source]▶

>>42176810 #

agreed, extracting tables in pdfs using any of the available openAI models has been a waste of prompting time here too.

45. infecto ◴[18 Nov 24 23:22 UTC] No.42178301{3}[source]▶

>>42176122 #

Have not used in on your docs but I can say that it definitely works well with forms and forms with tables like a Bill of Lading. It costs extra but you need to turn on table extract (at least in AWS). You then can get a markdown representation of that page include table, you can of course pull out the table itself but unless its standardized you will need the middleman LLM figuring out the exact data/structure you are looking for.

replies(1): >>42193210 #

46. vunderba ◴[18 Nov 24 23:44 UTC] No.42178522[source]▶

>>42171311 (OP) #

OP, you've been accused of literally ripping off somebody's more popular repository and posing it as your own.

https://news.ycombinator.com/item?id=42178413

You may wanna get ahead of this because the evidence is fairly damning. Failing to even give credit to the original project is a pretty gross move.

replies(1): >>42178774 #

47. alchemist1e9 ◴[18 Nov 24 23:45 UTC] No.42178533{3}[source]▶

>>42178185 #

Are there any reputation mechanisms or github flagging systems to alert users to such scams?

It’s a pretty unethical behavior if what you describe is the full story and as a user of many open source projects how can one be aware of this type of behavior?

48. Tammilore ◴[19 Nov 24 00:12 UTC] No.42178736{3}[source]▶

>>42178185 #

Hello. I apologize that it came across this way. This was not the intention. Zerox was definitely used and I made sure to copy and include the MIT license exactly as it was inside the part of the code that uses Zerox.

If there's any additional thing I can do, please let me know so I would make all amendements immediately.

replies(1): >>42200920 #

49. Tammilore ◴[19 Nov 24 00:16 UTC] No.42178774[source]▶

>>42178522 #

Hi. This was definitely not the intention.

I made sure to copy and past the MIT license in Zerox exactly as it was into the folder of the code that uses it. I also included it in the main license file as well. If there's anything I could do to make corrections please let me know so I'd change that ASAP.

replies(1): >>42210625 #

50. emmanueloga_ ◴[19 Nov 24 00:59 UTC] No.42179086{3}[source]▶

>>42176460 #

> That's not what [1] says, though?

Documind is using https://api.openai.com/v1/chat/completions, check the docs at the end of the long API table [1]:

> * Chat Completions:

> Image inputs via the gpt-4o, gpt-4o-mini, chatgpt-4o-latest, or gpt-4-turbo models (or previously gpt-4-vision-preview) are not eligible for zero retention."

1: https://platform.openai.com/docs/models#how-we-use-your-data

replies(1): >>42188577 #

51. emmanueloga_ ◴[19 Nov 24 05:12 UTC] No.42180258{4}[source]▶

>>42176810 #

Haven't seen Docling before, it looks great! Thanks for sharing.

52. ◴[19 Nov 24 18:04 UTC] No.42186391[source]▶

>>42171311 (OP) #

53. inexcf ◴[19 Nov 24 22:01 UTC] No.42188570{3}[source]▶

>>42172184 #

That sounds great.

54. groby_b ◴[19 Nov 24 22:01 UTC] No.42188577{4}[source]▶

>>42179086 #

Thanks for pointing there!

It's still not used for training, though, and the retention period is 30 days. It's... a livable compromise for some(many) use cases.

I kind of get the abuse policy reason for image inputs. It makes sense for multi-turn conversations to require a 1h audio retention, too. I'm just incredibly puzzled why schemas for structured outputs aren't eligible for zero-retention.

replies(2): >>42190973 #>>42225831 #

55. emmanueloga_ ◴[20 Nov 24 05:05 UTC] No.42190973{5}[source]▶

>>42188577 #

Gotcha, from what I could find online I think you are right. I was conflating data not under zero-retention-policy with data-for-training.

56. disgruntledphd2 ◴[20 Nov 24 12:15 UTC] No.42193210{4}[source]▶

>>42178301 #

Huh, interesting. I'll have to try again next time I need to parse stuff like this.

57. sidmo ◴[20 Nov 24 16:27 UTC] No.42195512[source]▶

>>42173837 #

If you are looking for the latest/greatest in file processing i'd recommend checking out vision language models. They generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. My company DataFog has an open-source demo if you want to try it out: https://github.com/DataFog/vlm-api

If you're looking for an all-in-one solution, little plug for our new platform that does the above and also allows you to create custom 'patterns' that get picked up via semantic search. Uses open-source models by default, can deploy into your internal network. www.datafog.ai. In beta now and onboarding manually. Shoot me an email if you'd like to learn more!

58. sidmo ◴[20 Nov 24 16:59 UTC] No.42195886{5}[source]▶

>>42177860 #

VLMs are cool - they generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. Here's an open-source API demo I built if you want to try it out: https://github.com/DataFog/vlm-api

59. sidmo ◴[20 Nov 24 17:00 UTC] No.42195901[source]▶

>>42171899 #

I'd recommend checking out vision language models. They generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. I built a simple API over it if you want to try it out: https://github.com/DataFog/vlm-api

60. dontdoxxme ◴[21 Nov 24 03:12 UTC] No.42200734{3}[source]▶

>>42178185 #

For the MIT license to make sense it needs a copyright notice, I don’t actually see one in the original license. It just says “The MIT license” but then the text below references the above copyright notice, which doesn’t exist.

I think both sides here can learn from this, copyright notices are technically not required but when some text references them it is very useful. The original author should have added one. The user of the code could also have asked about the copyright. If this were to go to court having the original license not making sense could create more questions than it should.

tl;dr: add a copyright line at the top of the file when you’re using the MIT license.

61. gmerc ◴[21 Nov 24 03:50 UTC] No.42200920{4}[source]▶

>>42178736 #

You took their code, did a search and replace on the product name and you're relicensed the code AGPL?

You're going to have to delete this thing and start over man.

replies(1): >>42203869 #

62. leojaygod ◴[21 Nov 24 12:55 UTC] No.42203869{5}[source]▶

>>42200920 #

It appears that the MIT license was correctly included to apply to the zerox code used while the AGPL license applies to their own code. Isn’t this how it should be?

63. slippy ◴[21 Nov 24 18:12 UTC] No.42206997[source]▶

>>42171311 (OP) #

Legit question: By _removing the MIT license_ from the distribution and replacing it with the AGPL, how are you not violating the copyright and subject to a lawsuit?

The MIT license has just 2 conditions. They are pretty easy to read, and the fist one is:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

By replacing the license, you violate this very simple agreement.

replies(1): >>42212394 #

64. ankenyr ◴[22 Nov 24 02:25 UTC] No.42210625{3}[source]▶

>>42178774 #

Your initial commit makes it look like you wrote all the code. https://github.com/DocumindHQ/documind/commit/d91121739df038... This is because you copied and uploaded the code instead of forking. You could do a lot by restoring attribution. Your history would look the same as https://github.com/getomni-ai/zerox/commits/main/ and diverge from where you forked.

People are getting upset because this is not a nice thing to do. Attribution is significant. No one would care if you replaced all the names with the new ones in a fork because they would see commits that do that.

replies(1): >>42212363 #

65. Tammilore ◴[22 Nov 24 09:22 UTC] No.42212363{4}[source]▶

>>42210625 #

Hi. Thank you for pointing this out. I totally understand now that forking would have kept the commit history visible and made the attribution clearer. I have since added a direct note in the repo acknowledging that it is built on the original Zerox project and also linked back to it. If there’s anything else you’d suggest, happy to hear it. Thanks again.

replies(1): >>42216000 #

66. Tammilore ◴[22 Nov 24 09:28 UTC] No.42212394[source]▶

>>42206997 #

Hi. Thanks for the question. To clarify, the MIT license was never removed or swapped. The license was and still is included in the folder that contains the code from the original project. In the root of the repository, I added the AGPL license for the new code I developed and made sure to explicitly acknowledge that the code in the folder is still under the MIT license.

I’ve also added a direct note acknowledging and linking back to the zerox project.

67. ankenyr ◴[22 Nov 24 18:03 UTC] No.42216000{5}[source]▶

>>42212363 #

It would be better to attribute. You can still do this by fixing the git commit history and doing a force push. It would do a lot to make people feel better.

68. pconstantine ◴[24 Nov 24 03:54 UTC] No.42225831{5}[source]▶

>>42188577 #

It takes >50 seconds to generate these schemas for some pretty simple use-cases with large enums, for example. Imagine that latency added to each request...

↑