I appreciate the clarification. Should have been more precise with my terminology.
That being said, I think I'm talking about the forest of PDFs.
When I said PDFs have a "markup-like structure," I was talking from my experience manually writing PDFs from scratch using Adobe's spec.
PDFs definitely have a structured, hierarchical format with nested elements that looks a lot like markup languages conceptually.
The objects have a structure comparable to DOM-like structures - there's clear parent-child relationships just like in markup languages. Working with tags like "<<" and ">>" feels similar to markup tags when hand coding them.
This is an article that highlights what I have seen (much cleaner PDF code): "The Structure of a PDF File" (https://medium.com/@jberkenbilt/the-structure-of-a-pdf-file-...) which says:
"There are several types of objects. If you are familiar with JSON, YAML, or the object model in any reasonably modern programming language, this will seem very familiar to you... A PDF object may have one of the following types: String, Number, Boolean, Null, Name, Array, Dictionary..."
This structure with dictionaries in "<<" and ">>" and arrays in brackets really gave me markup vibes when coding to the spec (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...).
While PDFs are an object graph with drawing instructions like you said, the structure itself looks a lot like markup formats.
Might be just a difference in choosing to focus on the forest vs the trees.
That hierarchical structure is why different PDF creation methods can make such varied document structures, which is exactly why text extraction is so tricky.
Learning to hand code PDFs in many ways, lets you learn to read and unravel them a little differently, maybe even a bit easier.