PDF to Text, a challenging problem

1. svat ◴[13 May 25 15:56 UTC] No.43974326[source]▶

One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.

There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):

    0 g 0 G
    0 g 0 G
    BT
    /F19 10.9091 Tf 88.936 709.041 Td
    [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
    -16.936 -21.922 Td
    [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
    0 -21.923 Td

and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.

replies(5): >>43974386 #>>43974502 #>>43975665 #>>43979271 #>>43985967 #

2. whenc ◴[13 May 25 16:01 UTC] No.43974386[source]▶

>>43974326 (TP) #

Try with cpdf (disclaimer, wrote it):

  cpdf -output-json -output-json-parse-content-streams in.pdf -o out.json

Then you can play around with the JSON, and turn it back to PDF with

  cpdf -j out.json -o out.pdf

No live back-and-forth though.

replies(2): >>43974627 #>>43977333 #

3. dleeftink ◴[13 May 25 16:11 UTC] No.43974502[source]▶

>>43974326 (TP) #

Have a look at this notebook[0], not exactly what you're looking for but does provide a 'live' inspector of the various drawing operations contained in a PDF.

[0]: https://observablehq.com/@player1537/pdf-utilities

replies(1): >>43974717 #

4. svat ◴[13 May 25 16:23 UTC] No.43974627[source]▶

>>43974386 #

The live back-and-forth is the main point of what I'm asking for — I tried your cpdf (thanks for the mention; will add it to my list) and it too doesn't help; all it does is, somewhere 9000-odd lines into the JSON file, turn the part of the content stream corresponding to what I mentioned in the earlier comment into:

        [
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ "BT" ],
          [ "/F19", { "F": 10.9091 }, "Tf" ],
          [ { "F": 88.93600000000001 }, { "F": 709.0410000000001 }, "Td" ],
          [
            [
              "Subsequen",
              { "F": 28.0 },
              "t",
              { "F": -374.0 },
              "to",
              { "F": -373.0 },
              "the",
              { "F": -373.0 },
              "p",
              { "F": -28.0 },
              "erio",
              { "F": -28.0 },
              "d",
              { "F": -373.0 },
              "analyzed",
              { "F": -373.0 },
              "in",
              { "F": -374.0 },
              "our",
              { "F": -373.0 },
              "study",
              { "F": 83.0 },
              ",",
              { "F": -383.0 },
              "Bridge's",
              { "F": -373.0 },
              "paren",
              { "F": 27.0 },
              "t",
              { "F": -373.0 },
              "compan",
              { "F": 28.0 },
              "y",
              { "F": -373.0 },
              "Ne",
              { "F": -1.0 },
              "wGlob",
              { "F": -27.0 },
              "e",
              { "F": -374.0 },
              "reduced"
            ],
            "TJ"
          ],
          [ { "F": -16.936 }, { "F": -21.922 }, "Td" ],

This is just a more verbose restatement of what's in the PDF file; the real questions I'm asking are:

- How can a user get to this part, from viewing the PDF file? (Note that the PDF page objects are not necessarily a flat list; they are often nested at different levels of “kids”.)

- How can a user understand these instructions, and “see” how they correspond to what is visually displayed on the PDF file?

5. svat ◴[13 May 25 16:31 UTC] No.43974717[source]▶

>>43974502 #

Thanks, but I was not able to figure out how to get any use out of the notebook above. In what sense is it a 'live' inspector? All it seems to do is to just decompose the PDF into separate “ops” and “args” arrays (neither of which is meaningful without the other), but it does not seem “live” in any sense — how can one find the ops (and args) corresponding to a region of the PDF page, or vice-versa?

replies(1): >>43974972 #

6. dleeftink ◴[13 May 25 16:50 UTC] No.43974972{3}[source]▶

>>43974717 #

You can load up your own PDF and select a page up front after which it will display the opcodes for this page. Operations are not structurally grouped, but decomposed in three aligned arrays which can be grouped to your liking based on opcode or used as coordinates for intersection queries (e.g. combining the ops and args arrays).

The 'liveness' here is that you can derive multiple downstream cells (e.g. filters, groupings, drawing instructions) from the initial parsed PDF, which will update as you swap out the PDF file.

7. kccqzy ◴[13 May 25 17:48 UTC] No.43975665[source]▶

>>43974326 (TP) #

When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might actually get something pretty close. For example I suppose each Tj becomes a <span> and each TJ becomes a collection of <span>s. (I'm fairly certain it doesn't use <canvas>.) And I suppose it must be very faithful to the original document to make it work.

replies(1): >>43976347 #

8. chaps ◴[13 May 25 18:54 UTC] No.43976347[source]▶

>>43975665 #

Indeed! I've used it to parse documents I've received through FOIA -- sometimes it's just easier to write beautifulsoup code compared to having to deal with PDF's oddities.

9. IIAOPSW ◴[13 May 25 20:28 UTC] No.43977333[source]▶

>>43974386 #

This might actually be something very valuable to me.

I have a bunch of documents right now that are annual statutory and financial disclosures of a large institute, and they are just barely differently organized from each year to the next to make it too tedious to cross compare them manually. I've been looking around for a tool that could break out the content and let me reorder it so that the same section is on the same page for every report.

This might be it.

10. hnick ◴[14 May 25 00:04 UTC] No.43979271[source]▶

>>43974326 (TP) #

I assume you mean open source or free, but just noting Acrobat Pro was almost there when I last used it years ago. The problem was you had it in reverse, browsing the content tree not inspecting the page, but it did highlight the object on the page. Not down to the command though, just the object/stream.

11. drguthals ◴[14 May 25 15:50 UTC] No.43985967[source]▶

>>43974326 (TP) #

"I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains."

Some combination of this is what we're building at Tensorlake (full disclosure I work there). Where you can "see" the PDF like a human and "understand" the contents, not JUST "read" the text. Because the contents of PDFs are usually in tables, images, text, formulas, hand-writing.

Being able to then "understand what a PDF file contains" is important (I think) for that understand part though. And so then we parse the PDF and run multiple models to extract markdown chunks/JSON so that you can ingest the actual data into other applications (AI agents, LLMs, or frankly whatever you want).

https://tensorlake.ai