PDF to Text, a challenging problem

I did some very broad testing of several PDF text extraction tools recently, and PDF.js was one of the slowest.

My use-case was specifically testing their performance as command-line tools, so that will skew the results to an extent. For example, PDFBox was very slow because you're paying the JVM startup cost with each invocation.

Poppler's pdftotext utility and pdfminer.six were generally the fastest. Both produced serviceable plain-text versions of the PDFs, with minor differences in where they placed paragraph breaks.

I also wrote a small program which extracted text using Chrome's PDFium, which also performed well, but building that project can be a nightmare unless you're Google. IBM's Docling project, which uses ML models, produced by far the best formatting, preserving much of the document's original structure – but it was, of course, enormously slower and more energy-hungry.

Disclaimer: I was testing specific PDF files that are representative of the kind of documents my software produces.