PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.202s | source

Show context

bartread ◴[13 May 25 15:38 UTC] No.43974140[source]▶

Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

replies(4): >>43974220 #>>43976252 #>>43976596 #>>43984384 #

j45 ◴[13 May 25 15:45 UTC] No.43974220[source]▶

>>43974140 #

PDFs inherently are a markup / xml format, the standard is available to learn from.

It's possible to create the same PDF in many, many, many ways.

Some might lean towards exporting a layout containing text and graphics from a graphics suite.

Others might lean towards exporting text and graphics from a word processor, which is words first.

The lens of how the creating app deals with information is often something that has input on how the PDF is output.

If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.

replies(2): >>43974556 #>>43980868 #

layer8 ◴[13 May 25 16:16 UTC] No.43974556[source]▶

>>43974220 #

> PDFs inherently are a markup / xml format

This is false. PDFs are an object graph containing imperative-style drawing instructions (among many other things). There’s a way to add structural information on top (akin to an HTML document structure), but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.

replies(3): >>43975265 #>>43978252 #>>43978643 #

1. davidthewatson ◴[13 May 25 17:13 UTC] No.43975265[source]▶

>>43974556 #

Thanks for your comment.

Indeed. Therein lies the rub.

Why?

Because no matter the fact that I've spent several years of my latent career crawling and parsing and outputting PDF data, I see now that pointing my LLLM stack at a directory of *.pdf just makes the invisible encoding of the object graph visible. It's a skeptical science.

The key transclusion may be to move from imperative to declarative tools or conditional to probabilistic tools, as many areas have in the last couple decades.

I've been following John Sterling's ocaml work for a while on related topics and the ideas floating around have been a good influence on me in forests and their forester which I found resonant given my own experience:

https://www.jonmsterling.com/index/index.xml

https://github.com/jonsterling/forest

I was gonna email john and ask whether it's still being worked on as I hope so, but I brought it up this morning as a way out of the noise that imperative programming PDF has been for a decade or more where turtles all the way down to the low-level root cause libraries mean that the high level imperative languages often display the exact same bugs despite significant differences as to what's being intended in the small on top of the stack vs the large on the bottom of the stack. It would help if "fitness for a particular purpose" decisions were thoughtful as to publishing and distribution but as the CFO likes to say, "Dave, that ship has already sailed." Sigh.

¯\_(ツ)_/¯

↑