(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.355s | source

Show context

bob1029 ◴[13 May 25 16:13 UTC] No.43974521[source]▶

When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

1. layer8 ◴[13 May 25 16:23 UTC] No.43974634[source]▶

If you assume standardized documents, you can impose the use of Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/

PDF to Text, a challenging problem