Popular/hot comments

(www.marginalia.nu)

Show context

bartread ◴[13 May 25 15:38 UTC] No.43974140[source]▶

Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

replies(4): >>43974220 #>>43976252 #>>43976596 #>>43984384 #

1. j45 ◴[13 May 25 15:45 UTC] No.43974220[source]▶

>>43974140 #

PDFs inherently are a markup / xml format, the standard is available to learn from.

It's possible to create the same PDF in many, many, many ways.

Some might lean towards exporting a layout containing text and graphics from a graphics suite.

Others might lean towards exporting text and graphics from a word processor, which is words first.

The lens of how the creating app deals with information is often something that has input on how the PDF is output.

If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.

replies(2): >>43974556 #>>43980868 #

2. layer8 ◴[13 May 25 16:16 UTC] No.43974556[source]▶

>>43974220 (TP) #

> PDFs inherently are a markup / xml format

This is false. PDFs are an object graph containing imperative-style drawing instructions (among many other things). There’s a way to add structural information on top (akin to an HTML document structure), but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.

replies(3): >>43975265 #>>43978252 #>>43978643 #

3. davidthewatson ◴[13 May 25 17:13 UTC] No.43975265[source]▶

>>43974556 #

Thanks for your comment.

Indeed. Therein lies the rub.

Why?

Because no matter the fact that I've spent several years of my latent career crawling and parsing and outputting PDF data, I see now that pointing my LLLM stack at a directory of *.pdf just makes the invisible encoding of the object graph visible. It's a skeptical science.

The key transclusion may be to move from imperative to declarative tools or conditional to probabilistic tools, as many areas have in the last couple decades.

I've been following John Sterling's ocaml work for a while on related topics and the ideas floating around have been a good influence on me in forests and their forester which I found resonant given my own experience:

https://www.jonmsterling.com/index/index.xml

https://github.com/jonsterling/forest

I was gonna email john and ask whether it's still being worked on as I hope so, but I brought it up this morning as a way out of the noise that imperative programming PDF has been for a decade or more where turtles all the way down to the low-level root cause libraries mean that the high level imperative languages often display the exact same bugs despite significant differences as to what's being intended in the small on top of the stack vs the large on the bottom of the stack. It would help if "fitness for a particular purpose" decisions were thoughtful as to publishing and distribution but as the CFO likes to say, "Dave, that ship has already sailed." Sigh.

¯\_(ツ)_/¯

4. j45 ◴[13 May 25 22:01 UTC] No.43978252[source]▶

>>43974556 #

I appreciate the clarification. Should have been more precise with my terminology.

That being said, I think I'm talking about the forest of PDFs.

When I said PDFs have a "markup-like structure," I was talking from my experience manually writing PDFs from scratch using Adobe's spec.

PDFs definitely have a structured, hierarchical format with nested elements that looks a lot like markup languages conceptually.

The objects have a structure comparable to DOM-like structures - there's clear parent-child relationships just like in markup languages. Working with tags like "<<" and ">>" feels similar to markup tags when hand coding them.

This is an article that highlights what I have seen (much cleaner PDF code): "The Structure of a PDF File" (https://medium.com/@jberkenbilt/the-structure-of-a-pdf-file-...) which says:

"There are several types of objects. If you are familiar with JSON, YAML, or the object model in any reasonably modern programming language, this will seem very familiar to you... A PDF object may have one of the following types: String, Number, Boolean, Null, Name, Array, Dictionary..."

This structure with dictionaries in "<<" and ">>" and arrays in brackets really gave me markup vibes when coding to the spec (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...).

While PDFs are an object graph with drawing instructions like you said, the structure itself looks a lot like markup formats.

Might be just a difference in choosing to focus on the forest vs the trees.

That hierarchical structure is why different PDF creation methods can make such varied document structures, which is exactly why text extraction is so tricky.

Learning to hand code PDFs in many ways, lets you learn to read and unravel them a little differently, maybe even a bit easier.

replies(1): >>43979148 #

5. bartread ◴[13 May 25 22:49 UTC] No.43978643[source]▶

>>43974556 #

> but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.

This is what I kind of suspected but, as I said in my original comment, I'm not an expert and for the PDFs I'm reading I didn't need to delve further because that metadata simply isn't in there (although, boy do I wish it was) so I needed to use a different approach. As soon as I realised what I had was purely presentation I knew it was going to be a bit grim.

6. layer8 ◴[13 May 25 23:48 UTC] No.43979148{3}[source]▶

>>43978252 #

Markup is only indirectly related to hierarchical structure. “Markup” means that there is text that is being “marked up” with additional attributes (styling, structure information, metadata, …). This is how HTML and XML work, and also languages like TeX, Troff, and Markdown. For example, in the text “this is some text”, you can mark up the word “some” as being emphasized, as in “this is <em>some</em> text”.

The general principle is that the base content is plain text, which is augmented with markup information, which may or may not have hierarchical aspects. You can simply strip away the markup again and recover just the text. That’s not at all how PDF works, however.

You cite a comparison to JSON and YAML. Those are not markup languages (despite what YAML originally was an abbreviation for, see [0]). (HTML also isn’t DOM.)

[0] https://stackoverflow.com/a/18928199

replies(1): >>43988651 #

7. jimjimjim ◴[14 May 25 04:38 UTC] No.43980868[source]▶

>>43974220 (TP) #

uh. There is very little XML and the spec is a thousand pages long.

replies(1): >>43996920 #

8. j45 ◴[14 May 25 20:07 UTC] No.43988651{4}[source]▶

>>43979148 #

I was quoting the article there about JSON/YAML, not making that claim myself.

Did you take a look at the article I linked? It shows visual examples of hand-coded PDFs that demonstrate the structural similarities I am talking about.

Thanks for the clarification on terminology. I could have been clearer and more precise. I referred to "DOM-like structures" as an analogy for the hierarchical nature of PDF objects, not to claim HTML is DOM.

My core point wasn't about the technical definition of markup languages, but about the structural similarity between PDF's object model and hierarchical formats.

When coding a PDF document by hand, you work with nested structures using delimiters like "<<" and ">>" that create hierarchical relationships between objects - which has practical parallels to working with nested elements in other formats.

The forest vs. trees metaphor was to acknowledge that while PDFs aren't primarily markup formats (the trees), they do share structural characteristics with hierarchical formats (the forest) based on my hands-on experience with manual PDF creation.

Hope that helps clarify things a bit.

9. j45 ◴[15 May 25 16:54 UTC] No.43996920[source]▶

>>43980868 #

Clarified above - referring to the visual side of coding PDFs by hand.

https://medium.com/@jberkenbilt/the-structure-of-a-pdf-file-...

↑

PDF to Text, a challenging problem