(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.207s | source

Show context

bob1029 ◴[13 May 25 16:13 UTC] No.43974521[source]▶

When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

replies(2): >>43974604 #>>43974634 #

1. kapitalx ◴[13 May 25 16:21 UTC] No.43974604[source]▶

>>43974521 #

This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.

↑

PDF to Text, a challenging problem