←back to thread

293 points lapnect | 1 comments | | HN request time: 0.001s | source
Show context
alecco ◴[] No.42155767[source]
Is it possible to do this locally with open source software? I have a lot of accounting PDFs to convert but due to privacy concerns it should not run in the cloud.
replies(4): >>42155855 #>>42156382 #>>42156587 #>>42156958 #
1. cess11 ◴[] No.42156382[source]
Either you need to be somewhat tolerant when it comes to misinterpretations and hallucinations, or you'll be proofreading a lot.

A cheap hack is to push the documents through pdftotext from Poppler and if nothing or very little comes out, push them through OCRMyPDF and pipe it to pdftotext. If it's scanned you probably want some flags for deskewing and so on.

To make a bulk load of PDF mostly greppable it's a decent technique, to get every 0 as a 0 you're probably going to proofread every conversion.