←back to thread

1303 points serjester | 3 comments | | HN request time: 0.629s | source
Show context
twelve40 ◴[] No.42956704[source]
Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.
replies(6): >>42956861 #>>42956872 #>>42960357 #>>42961573 #>>42963880 #>>42967188 #
sconeguy ◴[] No.42956861[source]
What would it have taken to store the plain text in some meta field in the document. Argh, so annoying.
replies(2): >>42957625 #>>42957655 #
dkjaudyeqooe ◴[] No.42957625[source]
PDF provide that capability, but editors don't produce it, probably because printing is though OS drivers that don't support it, or PDF generators that don't support it. Or they do support it but users don't know to check that option, or turn it off because it makes PDFs too large.
replies(1): >>42958861 #
1. user_7832 ◴[] No.42958861[source]
Do you know what this field/type is called, and I’d any of the big names (MS/Adobe etc) support creating such PDFs?
replies(2): >>42959942 #>>42960671 #
2. bux93 ◴[] No.42959942[source]
OCR software like ABBY can spit out something called a "searchable PDF", which has a text layer underneath a picture of a scan. Otherwise, PDF has 'dictionaries' with arbitrary key-value pairs in them. The "Info" dictionary has some specific metadata fields like Author, and a "Font" dictionary embeds fonts, but you're free to use those dictionaries for whatever. There's also a standard to embed 'dublin core', rights management and custom metadata called XMP. Files can be embedded. You can also use comments, as PDF is a subset of postscript. When a PDF gets converted to PDF/A (by archiving software) or flattened/optimized, most of these are likely to be lost.
3. dkjaudyeqooe ◴[] No.42960671[source]
I believe it's a "hybrid PDF" but I'm not sure if there's a further standard for merely embedding text.

https://stackoverflow.com/questions/67358370/what-the-standa...