PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.001s | source

Show context

90s_dev ◴[13 May 25 18:18 UTC] No.43975996[source]▶

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

replies(10): >>43976023 #>>43976086 #>>43976796 #>>43977155 #>>43977580 #>>43977605 #>>43978171 #>>43979324 #>>43980118 #>>43981115 #

hallman76 ◴[14 May 25 02:32 UTC] No.43980118[source]▶

>>43975996 #

We will never get back the collective man-decades of time that has been burned by this format. When will the madness stop?

replies(3): >>43980258 #>>43980692 #>>43984513 #

theamk ◴[14 May 25 02:54 UTC] No.43980258[source]▶

>>43980118 #

When we get an alternative that can:

(1) be stored in a single file

(2) Allow tables, images and anything else that can be shown on a piece paper

(3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper

(4) won't require Javascript or access to external sites

that means never.. We've got lucky we at least got PDF before "web designers" made (3) impossible, and marketers made (4) impossible

replies(7): >>43980448 #>>43980591 #>>43981173 #>>43981225 #>>43981712 #>>43982818 #>>43984377 #

protocolture ◴[14 May 25 03:47 UTC] No.43980591[source]▶

>>43980258 #

Behold a Bitmap.

But for real, thats a pretty easy set of hurdles. Really the barrier is the psychological fallacy that PDF's are immutable.

replies(1): >>43980708 #

theamk ◴[14 May 25 04:09 UTC] No.43980708[source]▶

>>43980591 #

Should have added "looks good on screen and on paper", "stores text compactly" and "multiple pages supported" :) And yes, that's a pretty easy set of hurdles. I wish we'd standardized on DjVu instead.

Re "PDF's are immutable." - that's not a psychological fallacy, that's a primary advantage of PDFs. If I wanted mutable format, I'd take an odt (or rtf or a doc). "Output only" format allows one to use the very latest version of editor app, while having the result working even in ancient readers, something very desirable in many contexts.

replies(4): >>43980842 #>>43982603 #>>43982758 #>>43986015 #

imtringued ◴[14 May 25 09:34 UTC] No.43982603[source]▶

>>43980708 #

PDFs are not really immutable. I use Okular all the time to write my "notes" (it's just text that you can place anywhere) on top of a PDF form and then print out a new completely filled out PDF. The only thing I do by hand is sign the physical paper.

replies(1): >>43991100 #

iLemming ◴[15 May 25 01:54 UTC] No.43991100[source]▶

>>43982603 #

Your understanding of immutability feels skewed here. Every time you annotate the PDF, it creates a new version. Even when you overwrite the same file, the structure of the original document changes, therefore creating a new document, ultimately making it "the ship of Theseus.pdf"

Sure, someone may try using the same argument, applying it to .doc and .txt documents, yet there is a general consensus saying that pdfs were designed to "resist the change". You can probably self-illustrate the point by making changes to a .txt document and then removing your changes - the md5 of the file would remain the same.

replies(1): >>44000833 #

me-vs-cat ◴[16 May 25 00:57 UTC] No.44000833{3}[source]▶

>>43991100 #

Have you ever used Acrobat? Not "Acrobat Reader", but regular Acrobat, the most popular PDF editor. It's from Adobe, and it definitely does not "resist" edits.

replies(1): >>44001063 #

iLemming ◴[16 May 25 01:39 UTC] No.44001063{4}[source]▶

>>44000833 #

I got what you're saying the first time, and you still seem to be entirely missing the point. Immutability means that an object cannot be modified after it's created, and any changes result in a new object rather than altering the original.

You're saying "well, look, I can modify this pdf and I can even undo my changes...", what I'm saying is that whenever you modify a PDF, you're essentially creating a new file rather than truly "undoing" changes in the original. PDFs have complex internal structures with metadata, object references, and possibly compression that make bit-perfect restoration challenging.

Unlike plain text files where changes can be precisely tracked and reversed at the character level, PDFs don't easily support this kind of granular reversibility. Even "undoing" in PDF editors often means generating yet another variant rather than returning to the exact binary state of the original.

Take a look at how Git stores PDFs - when the delta approach doesn't work efficiently since even small logical changes can result in significantly different binary files with completely different checksums, it stores EVERY version of the same document in a separate blob object.

When you annotate a pdf and then later change your mind, undo all the annotations and save it — only to your eyes it may look the same as the original — in digital reality, it will be a different file.

replies(2): >>44001184 #>>44010198 #

1. me-vs-cat ◴[16 May 25 22:07 UTC] No.44010198{5}[source]▶

>>44001063 #

> I got what you're saying the first time,

That wasn't me. Multiple people were taking the time to help you understand.

↑