←back to thread

357 points ingve | 3 comments | | HN request time: 0.235s | source
1. bob1029 ◴[] No.43974521[source]
When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

replies(2): >>43974604 #>>43974634 #
2. kapitalx ◴[] No.43974604[source]
This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.
3. layer8 ◴[] No.43974634[source]
If you assume standardized documents, you can impose the use of Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/