PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

rad_gruchalski ◴[13 May 25 15:31 UTC] No.43974057[source]▶

So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/.

replies(3): >>43974240 #>>43974428 #>>43975184 #

egnehots ◴[13 May 25 16:05 UTC] No.43974428[source]▶

>>43974057 #

I don't think so, pdf.js is able to render a pdf content.

Which is different from extracting "text". Text in PDF can be encoded in many ways, in an actual image, in shapes (think, segments, quadratic bezier curves...), or in an XML format (really easy to process).

PDF viewers are able to render text, like a printer would work, processing command to show pixels on the screen at the end.

But often, paragraph, text layout, columns, tables are lost in the process. Even though, you see them, so close yet so far. That is why AI is quite strong at this task.

replies(2): >>43974734 #>>43975239 #

lionkor ◴[13 May 25 16:32 UTC] No.43974734[source]▶

>>43974428 #

Correct me if im wrong, but pdf.js actually has a lot of methods to manipulate PDFs, no?

replies(1): >>43976996 #

1. rad_gruchalski ◴[13 May 25 19:55 UTC] No.43976996[source]▶

>>43974734 #

Yes, pdf.js can do that: https://github.com/mozilla/pdf.js/blob/master/web/viewer.htm....

The purpose of my original comment was to simply say: there’s an existing implementation so if you’re building a pdf file viewer/editor, and you need inspiration, have a look. One of the reasons why mozilla is doing this is to be a reference implementation. I’m not sure why people are upset with this. Though, I could have explained it better.

↑