←back to thread

357 points ingve | 1 comments | | HN request time: 0s | source
Show context
bartread ◴[] No.43974140[source]
Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

replies(4): >>43974220 #>>43976252 #>>43976596 #>>43984384 #
j45 ◴[] No.43974220[source]
PDFs inherently are a markup / xml format, the standard is available to learn from.

It's possible to create the same PDF in many, many, many ways.

Some might lean towards exporting a layout containing text and graphics from a graphics suite.

Others might lean towards exporting text and graphics from a word processor, which is words first.

The lens of how the creating app deals with information is often something that has input on how the PDF is output.

If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.

replies(2): >>43974556 #>>43980868 #
layer8 ◴[] No.43974556[source]
> PDFs inherently are a markup / xml format

This is false. PDFs are an object graph containing imperative-style drawing instructions (among many other things). There’s a way to add structural information on top (akin to an HTML document structure), but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.

replies(3): >>43975265 #>>43978252 #>>43978643 #
1. bartread ◴[] No.43978643[source]
> but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.

This is what I kind of suspected but, as I said in my original comment, I'm not an expert and for the PDFs I'm reading I didn't need to delve further because that metadata simply isn't in there (although, boy do I wish it was) so I needed to use a different approach. As soon as I realised what I had was purely presentation I knew it was going to be a bit grim.