PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 2 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

bartread ◴[13 May 25 15:38 UTC] No.43974140[source]▶

Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

replies(4): >>43974220 #>>43976252 #>>43976596 #>>43984384 #

j45 ◴[13 May 25 15:45 UTC] No.43974220[source]▶

>>43974140 #

PDFs inherently are a markup / xml format, the standard is available to learn from.

It's possible to create the same PDF in many, many, many ways.

Some might lean towards exporting a layout containing text and graphics from a graphics suite.

Others might lean towards exporting text and graphics from a word processor, which is words first.

The lens of how the creating app deals with information is often something that has input on how the PDF is output.

If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.

replies(2): >>43974556 #>>43980868 #

1. jimjimjim ◴[14 May 25 04:38 UTC] No.43980868[source]▶

>>43974220 #

uh. There is very little XML and the spec is a thousand pages long.

replies(1): >>43996920 #

2. j45 ◴[15 May 25 16:54 UTC] No.43996920[source]▶

>>43980868 (TP) #

Clarified above - referring to the visual side of coding PDFs by hand.

https://medium.com/@jberkenbilt/the-structure-of-a-pdf-file-...

↑