PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

svat ◴[13 May 25 15:56 UTC] No.43974326[source]▶

One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.

There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):

    0 g 0 G
    0 g 0 G
    BT
    /F19 10.9091 Tf 88.936 709.041 Td
    [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
    -16.936 -21.922 Td
    [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
    0 -21.923 Td

and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.

replies(5): >>43974386 #>>43974502 #>>43975665 #>>43979271 #>>43985967 #

dleeftink ◴[13 May 25 16:11 UTC] No.43974502[source]▶

>>43974326 #

Have a look at this notebook[0], not exactly what you're looking for but does provide a 'live' inspector of the various drawing operations contained in a PDF.

[0]: https://observablehq.com/@player1537/pdf-utilities

replies(1): >>43974717 #

svat ◴[13 May 25 16:31 UTC] No.43974717[source]▶

>>43974502 #

Thanks, but I was not able to figure out how to get any use out of the notebook above. In what sense is it a 'live' inspector? All it seems to do is to just decompose the PDF into separate “ops” and “args” arrays (neither of which is meaningful without the other), but it does not seem “live” in any sense — how can one find the ops (and args) corresponding to a region of the PDF page, or vice-versa?

replies(1): >>43974972 #

1. dleeftink ◴[13 May 25 16:50 UTC] No.43974972[source]▶

>>43974717 #

You can load up your own PDF and select a page up front after which it will display the opcodes for this page. Operations are not structurally grouped, but decomposed in three aligned arrays which can be grouped to your liking based on opcode or used as coordinates for intersection queries (e.g. combining the ops and args arrays).

The 'liveness' here is that you can derive multiple downstream cells (e.g. filters, groupings, drawing instructions) from the initial parsed PDF, which will update as you swap out the PDF file.

↑