(www.marginalia.nu)

357 points ingve | 3 comments | 13 May 25 15:01 UTC | HN request time: 0.235s | source

1. bob1029 ◴[13 May 25 16:13 UTC] No.43974521[source]▶

When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

replies(2): >>43974604 #>>43974634 #

2. kapitalx ◴[13 May 25 16:21 UTC] No.43974604[source]▶

>>43974521 (TP) #

This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.

3. layer8 ◴[13 May 25 16:23 UTC] No.43974634[source]▶

>>43974521 (TP) #

If you assume standardized documents, you can impose the use of Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/

↑

PDF to Text, a challenging problem