I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.
I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.
(1) be stored in a single file
(2) Allow tables, images and anything else that can be shown on a piece paper
(3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper
(4) won't require Javascript or access to external sites
that means never.. We've got lucky we at least got PDF before "web designers" made (3) impossible, and marketers made (4) impossible
But for real, thats a pretty easy set of hurdles. Really the barrier is the psychological fallacy that PDF's are immutable.
Re "PDF's are immutable." - that's not a psychological fallacy, that's a primary advantage of PDFs. If I wanted mutable format, I'd take an odt (or rtf or a doc). "Output only" format allows one to use the very latest version of editor app, while having the result working even in ancient readers, something very desirable in many contexts.
If you want alternatives, I'd choose DjVu. But it's too late now, everyone is converged on PDFs, and the alternatives are not good enough to warrant the switch.
> (4) won't require Javascript or access to external sites
So about that... https://opensource.adobe.com/dc-acrobat-sdk-docs/library/jsa...
(0) that reproduce everywhere on any OS perfectly
(0.5) that supports (everything) any typographical engineers ever wanted past and future
Bitmap formats are out from clause -1, Office file formats disqualify from clause 0, Markdown doesn't satisfy clause 0.5. Otherwise a Word .doc format covers most of clauses 1-4.
Can somebody explain why this isn't the case for HTML? I'm frequently in a situation where a website that mimics printed pages fails to render the same between Firefox and Chrome. I wish to understand the primary culprit here. I thought all of the CSS units are completely defined?
It remains possible to have a pdf with text that is easily mutable with any text editor.
Even if text inside a pdf is annoyingly encoded, you can always just replace the appropriate object/text streams... if you can identify the right one(s). You can extract and edit and re-insert, or simply replace, embedded images as well.
I don't think "this format promotes, as the norm, so much obfuscation of basic text objects that it becomes impractical to edit them in situ without wholesale replacement" is the win you think it is.
"Looks good on paper" has to do with the rendering engine (largely high-DPI and good font handling/spacing/kerning), not PDF as a content layout/presentation format. A high-quality software rasterizer (for postscript or PDF, often embedded in the printer)—not the PDF file format—has been the magic sauce.
Today, some large portion of end-user interaction with PDFs is via rendering into a web browser DOM via javascript. Text in PDFs is rendered as text in the browser. Perhaps nothing else demonstrates more clearly that the "PDF is superior" argument is invalid.
You also can't really embed fonts in a HTML file, you rely on linking instead -- and those can rot. Apparently there has been some work towards it (base64 encoded), but support may vary. And you need to embed the whole font, I don't think you can do character subsets easily.
Sure, someone may try using the same argument, applying it to .doc and .txt documents, yet there is a general consensus saying that pdfs were designed to "resist the change". You can probably self-illustrate the point by making changes to a .txt document and then removing your changes - the md5 of the file would remain the same.
And this is actually pretty great, maybe even the best part of PDFs! Companies _know_ that publishing PDF that require 3d-graphics or Javascript means many people won't be able to see them, so they publish good, static PDFs, maintaining virtuous cycle.
You're saying "well, look, I can modify this pdf and I can even undo my changes...", what I'm saying is that whenever you modify a PDF, you're essentially creating a new file rather than truly "undoing" changes in the original. PDFs have complex internal structures with metadata, object references, and possibly compression that make bit-perfect restoration challenging.
Unlike plain text files where changes can be precisely tracked and reversed at the character level, PDFs don't easily support this kind of granular reversibility. Even "undoing" in PDF editors often means generating yet another variant rather than returning to the exact binary state of the original.
Take a look at how Git stores PDFs - when the delta approach doesn't work efficiently since even small logical changes can result in significantly different binary files with completely different checksums, it stores EVERY version of the same document in a separate blob object.
When you annotate a pdf and then later change your mind, undo all the annotations and save it — only to your eyes it may look the same as the original — in digital reality, it will be a different file.
Also, that isnt even an intention of the file format as far as I can see, its mostly a byproduct of cruft and backwards compatibility.
No one would call .doc immutable because its very difficult to move an image and then restore that image to the original location.
In this context, people will save something out as pdf and store it because they dont think it cannot be modified.
But as has been rightly pointed out, thats not the case.
Immutability doesn't mean that an "object cannot be modified", it means that in order to modify an object, you must create a new (clone) object. That's all what I meant to say. Sure, you can get pedantic or otherwise and say "yes, pdfs are immutable; or no, pdfs aren't immutable in some contexts", etc., and depending on the point of view, both of these claims could be correct — I'm not arguing about the specifics.
I'm just saying that your explanation of why you think pdfs are not immutable hinges on an incorrect idea of what immutability actually is.
There's a rigorous definition for "immutability" in computer science, e.g., strings in many programming languages are immutable, but that doesn't mean you can't manipulate them, it just means that operations that appear to modify strings actually create new string objects.
The greatest illustration for immutability is imbued in programming languages with immutability-by-default, e.g., Clojure. Once someone groks the basics, it becomes really clear what that thing is about.