←back to thread

357 points ingve | 1 comments | | HN request time: 0.355s | source
Show context
bob1029 ◴[] No.43974521[source]
When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

replies(2): >>43974604 #>>43974634 #
1. layer8 ◴[] No.43974634[source]
If you assume standardized documents, you can impose the use of Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/