←back to thread

293 points lapnect | 5 comments | | HN request time: 0.645s | source
1. alecco ◴[] No.42155767[source]
Is it possible to do this locally with open source software? I have a lot of accounting PDFs to convert but due to privacy concerns it should not run in the cloud.
replies(4): >>42155855 #>>42156382 #>>42156587 #>>42156958 #
2. criddell ◴[] No.42155855[source]
Does it have to be open source, or just running locally? The paid version of Acrobat does this well. MacOS has pretty good built-in OCR capabilities and Windows isn’t far behind.

If you have the hardware for it, you can run some LLMs locally. Although for accounting data, I probably wouldn’t trust it.

3. cess11 ◴[] No.42156382[source]
Either you need to be somewhat tolerant when it comes to misinterpretations and hallucinations, or you'll be proofreading a lot.

A cheap hack is to push the documents through pdftotext from Poppler and if nothing or very little comes out, push them through OCRMyPDF and pipe it to pdftotext. If it's scanned you probably want some flags for deskewing and so on.

To make a bulk load of PDF mostly greppable it's a decent technique, to get every 0 as a 0 you're probably going to proofread every conversion.

4. Eisenstein ◴[] No.42156587[source]
I don't recommend using it for anything important unless you very diligently proofread it, but I made one that runs locally that I linked to elsewhere in this post:

* https://news.ycombinator.com/item?id=42155548

5. bugglebeetle ◴[] No.42156958[source]
Yes, Docling and Marker do very similar things and can be run fully locally.