PDF to Text, a challenging problem

1. rad_gruchalski ◴[13 May 25 15:31 UTC] No.43974057[source]▶

So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/.

replies(3): >>43974240 #>>43974428 #>>43975184 #

2. zzleeper ◴[13 May 25 15:47 UTC] No.43974240[source]▶

>>43974057 (TP) #

Any sense on how PDF.js compares against other tools such as pdfminer?

replies(2): >>43975258 #>>43977036 #

3. egnehots ◴[13 May 25 16:05 UTC] No.43974428[source]▶

>>43974057 (TP) #

I don't think so, pdf.js is able to render a pdf content.

Which is different from extracting "text". Text in PDF can be encoded in many ways, in an actual image, in shapes (think, segments, quadratic bezier curves...), or in an XML format (really easy to process).

PDF viewers are able to render text, like a printer would work, processing command to show pixels on the screen at the end.

But often, paragraph, text layout, columns, tables are lost in the process. Even though, you see them, so close yet so far. That is why AI is quite strong at this task.

replies(2): >>43974734 #>>43975239 #

4. lionkor ◴[13 May 25 16:32 UTC] No.43974734[source]▶

>>43974428 #

Correct me if im wrong, but pdf.js actually has a lot of methods to manipulate PDFs, no?

replies(1): >>43976996 #

5. iAMkenough ◴[13 May 25 17:05 UTC] No.43975184[source]▶

>>43974057 (TP) #

A good PDF reader makes the problems easier to deal with, but does not solve the underlying issue.

The PDF itself is still flawed, even if pdf.js interprets it perfectly, which is still a problem for non-pdf.js viewers and tasks where "viewing" isn't the primary goal.

replies(1): >>43976494 #

6. rad_gruchalski ◴[13 May 25 17:11 UTC] No.43975239[source]▶

>>43974428 #

You are wrong. Pdf.js can extract text and has all facilities required to render and extract formatting. The latest version can also edit PDF files. It’s basically the same engine as the Firefox PDF viewer. Which also has a document outline, search, linking, print preview, scaling, scripting sandbox… it does not simply „render” a file.

Regarding tables, this here https://www.npmjs.com/package/pdf-table-extractor does a very good job at table interpretation and works on top of pdf.js.

I also didn’t say what works better or worse, neither do I go into PDF being good or bad.

I simply said that a ton of problems have been covered by

7. rad_gruchalski ◴[13 May 25 17:12 UTC] No.43975258[source]▶

>>43974240 #

I don’t know. I use pdf.js for everything PDF.

8. rad_gruchalski ◴[13 May 25 19:10 UTC] No.43976494[source]▶

>>43975184 #

Yeah. What I’m saying: pdf.js seems to have some of these solved. All I’m suggesting is have a look at it. I get it that for some PDF is a broken format.

9. rad_gruchalski ◴[13 May 25 19:55 UTC] No.43976996{3}[source]▶

>>43974734 #

Yes, pdf.js can do that: https://github.com/mozilla/pdf.js/blob/master/web/viewer.htm....

The purpose of my original comment was to simply say: there’s an existing implementation so if you’re building a pdf file viewer/editor, and you need inspiration, have a look. One of the reasons why mozilla is doing this is to be a reference implementation. I’m not sure why people are upset with this. Though, I could have explained it better.

10. favorited ◴[13 May 25 19:59 UTC] No.43977036[source]▶

>>43974240 #

I did some very broad testing of several PDF text extraction tools recently, and PDF.js was one of the slowest.

My use-case was specifically testing their performance as command-line tools, so that will skew the results to an extent. For example, PDFBox was very slow because you're paying the JVM startup cost with each invocation.

Poppler's pdftotext utility and pdfminer.six were generally the fastest. Both produced serviceable plain-text versions of the PDFs, with minor differences in where they placed paragraph breaks.

I also wrote a small program which extracted text using Chrome's PDFium, which also performed well, but building that project can be a nightmare unless you're Google. IBM's Docling project, which uses ML models, produced by far the best formatting, preserving much of the document's original structure – but it was, of course, enormously slower and more energy-hungry.

Disclaimer: I was testing specific PDF files that are representative of the kind of documents my software produces.