PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.237s | source

Show context

svat ◴[13 May 25 15:56 UTC] No.43974326[source]▶

One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.

There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):

    0 g 0 G
    0 g 0 G
    BT
    /F19 10.9091 Tf 88.936 709.041 Td
    [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
    -16.936 -21.922 Td
    [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
    0 -21.923 Td

and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.

replies(5): >>43974386 #>>43974502 #>>43975665 #>>43979271 #>>43985967 #

whenc ◴[13 May 25 16:01 UTC] No.43974386[source]▶

>>43974326 #

Try with cpdf (disclaimer, wrote it):

  cpdf -output-json -output-json-parse-content-streams in.pdf -o out.json

Then you can play around with the JSON, and turn it back to PDF with

  cpdf -j out.json -o out.pdf

No live back-and-forth though.

replies(2): >>43974627 #>>43977333 #

1. svat ◴[13 May 25 16:23 UTC] No.43974627[source]▶

>>43974386 #

The live back-and-forth is the main point of what I'm asking for — I tried your cpdf (thanks for the mention; will add it to my list) and it too doesn't help; all it does is, somewhere 9000-odd lines into the JSON file, turn the part of the content stream corresponding to what I mentioned in the earlier comment into:

        [
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ "BT" ],
          [ "/F19", { "F": 10.9091 }, "Tf" ],
          [ { "F": 88.93600000000001 }, { "F": 709.0410000000001 }, "Td" ],
          [
            [
              "Subsequen",
              { "F": 28.0 },
              "t",
              { "F": -374.0 },
              "to",
              { "F": -373.0 },
              "the",
              { "F": -373.0 },
              "p",
              { "F": -28.0 },
              "erio",
              { "F": -28.0 },
              "d",
              { "F": -373.0 },
              "analyzed",
              { "F": -373.0 },
              "in",
              { "F": -374.0 },
              "our",
              { "F": -373.0 },
              "study",
              { "F": 83.0 },
              ",",
              { "F": -383.0 },
              "Bridge's",
              { "F": -373.0 },
              "paren",
              { "F": 27.0 },
              "t",
              { "F": -373.0 },
              "compan",
              { "F": 28.0 },
              "y",
              { "F": -373.0 },
              "Ne",
              { "F": -1.0 },
              "wGlob",
              { "F": -27.0 },
              "e",
              { "F": -374.0 },
              "reduced"
            ],
            "TJ"
          ],
          [ { "F": -16.936 }, { "F": -21.922 }, "Td" ],

This is just a more verbose restatement of what's in the PDF file; the real questions I'm asking are:

- How can a user get to this part, from viewing the PDF file? (Note that the PDF page objects are not necessarily a flat list; they are often nested at different levels of “kids”.)

- How can a user understand these instructions, and “see” how they correspond to what is visually displayed on the PDF file?

↑