PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

90s_dev ◴[13 May 25 18:18 UTC] No.43975996[source]▶

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

replies(10): >>43976023 #>>43976086 #>>43976796 #>>43977155 #>>43977580 #>>43977605 #>>43978171 #>>43979324 #>>43980118 #>>43981115 #

hallman76 ◴[14 May 25 02:32 UTC] No.43980118[source]▶

>>43975996 #

We will never get back the collective man-decades of time that has been burned by this format. When will the madness stop?

replies(3): >>43980258 #>>43980692 #>>43984513 #

theamk ◴[14 May 25 02:54 UTC] No.43980258[source]▶

>>43980118 #

When we get an alternative that can:

(1) be stored in a single file

(2) Allow tables, images and anything else that can be shown on a piece paper

(3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper

(4) won't require Javascript or access to external sites

that means never.. We've got lucky we at least got PDF before "web designers" made (3) impossible, and marketers made (4) impossible

replies(7): >>43980448 #>>43980591 #>>43981173 #>>43981225 #>>43981712 #>>43982818 #>>43984377 #

protocolture ◴[14 May 25 03:47 UTC] No.43980591[source]▶

>>43980258 #

Behold a Bitmap.

But for real, thats a pretty easy set of hurdles. Really the barrier is the psychological fallacy that PDF's are immutable.

replies(1): >>43980708 #

theamk ◴[14 May 25 04:09 UTC] No.43980708[source]▶

>>43980591 #

Should have added "looks good on screen and on paper", "stores text compactly" and "multiple pages supported" :) And yes, that's a pretty easy set of hurdles. I wish we'd standardized on DjVu instead.

Re "PDF's are immutable." - that's not a psychological fallacy, that's a primary advantage of PDFs. If I wanted mutable format, I'd take an odt (or rtf or a doc). "Output only" format allows one to use the very latest version of editor app, while having the result working even in ancient readers, something very desirable in many contexts.

replies(4): >>43980842 #>>43982603 #>>43982758 #>>43986015 #

harshreality ◴[14 May 25 10:07 UTC] No.43982758[source]▶

>>43980708 #

What's immutable, without tools to decompress and possibly perform further de-obfuscation of text streams, is the typical way publishing software encodes text into streams inside PDFs.

It remains possible to have a pdf with text that is easily mutable with any text editor.

Even if text inside a pdf is annoyingly encoded, you can always just replace the appropriate object/text streams... if you can identify the right one(s). You can extract and edit and re-insert, or simply replace, embedded images as well.

I don't think "this format promotes, as the norm, so much obfuscation of basic text objects that it becomes impractical to edit them in situ without wholesale replacement" is the win you think it is.

"Looks good on paper" has to do with the rendering engine (largely high-DPI and good font handling/spacing/kerning), not PDF as a content layout/presentation format. A high-quality software rasterizer (for postscript or PDF, often embedded in the printer)—not the PDF file format—has been the magic sauce.

Today, some large portion of end-user interaction with PDFs is via rendering into a web browser DOM via javascript. Text in PDFs is rendered as text in the browser. Perhaps nothing else demonstrates more clearly that the "PDF is superior" argument is invalid.

replies(1): >>43986093 #

1. me-vs-cat ◴[14 May 25 16:00 UTC] No.43986093[source]▶

>>43982758 #

> you can always just replace the appropriate object/text streams

Or right-click and select Edit. Works in several PDF editors, on both text and image content.

↑