I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.
I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.
(1) be stored in a single file
(2) Allow tables, images and anything else that can be shown on a piece paper
(3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper
(4) won't require Javascript or access to external sites
that means never.. We've got lucky we at least got PDF before "web designers" made (3) impossible, and marketers made (4) impossible
But for real, thats a pretty easy set of hurdles. Really the barrier is the psychological fallacy that PDF's are immutable.
Re "PDF's are immutable." - that's not a psychological fallacy, that's a primary advantage of PDFs. If I wanted mutable format, I'd take an odt (or rtf or a doc). "Output only" format allows one to use the very latest version of editor app, while having the result working even in ancient readers, something very desirable in many contexts.
It remains possible to have a pdf with text that is easily mutable with any text editor.
Even if text inside a pdf is annoyingly encoded, you can always just replace the appropriate object/text streams... if you can identify the right one(s). You can extract and edit and re-insert, or simply replace, embedded images as well.
I don't think "this format promotes, as the norm, so much obfuscation of basic text objects that it becomes impractical to edit them in situ without wholesale replacement" is the win you think it is.
"Looks good on paper" has to do with the rendering engine (largely high-DPI and good font handling/spacing/kerning), not PDF as a content layout/presentation format. A high-quality software rasterizer (for postscript or PDF, often embedded in the printer)—not the PDF file format—has been the magic sauce.
Today, some large portion of end-user interaction with PDFs is via rendering into a web browser DOM via javascript. Text in PDFs is rendered as text in the browser. Perhaps nothing else demonstrates more clearly that the "PDF is superior" argument is invalid.