Most active commenters
  • dkjaudyeqooe(3)

←back to thread

1303 points serjester | 14 comments | | HN request time: 0.953s | source | bottom
1. twelve40 ◴[] No.42956704[source]
Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.
replies(6): >>42956861 #>>42956872 #>>42960357 #>>42961573 #>>42963880 #>>42967188 #
2. sconeguy ◴[] No.42956861[source]
What would it have taken to store the plain text in some meta field in the document. Argh, so annoying.
replies(2): >>42957625 #>>42957655 #
3. shermantanktop ◴[] No.42956872[source]
is "put this glyph at coordinate (x,y)" really what you'd call "structured"?
replies(2): >>42957640 #>>42959081 #
4. dkjaudyeqooe ◴[] No.42957625[source]
PDF provide that capability, but editors don't produce it, probably because printing is though OS drivers that don't support it, or PDF generators that don't support it. Or they do support it but users don't know to check that option, or turn it off because it makes PDFs too large.
replies(1): >>42958861 #
5. dkjaudyeqooe ◴[] No.42957640[source]
He's calling PDFs unstructured: structured editors -> unstructured PDF -> structured data
6. groby_b ◴[] No.42957655[source]
PDF supports that just fine. It's just that many PDF publishers choose not to use that.

You can lead a horse to water...

7. user_7832 ◴[] No.42958861{3}[source]
Do you know what this field/type is called, and I’d any of the big names (MS/Adobe etc) support creating such PDFs?
replies(2): >>42959942 #>>42960671 #
8. irjustin ◴[] No.42959081[source]
It's not the structure that allows meaningful understanding.

Something that was clearly a table now becomes a bunch of glphy's physically close to eachother vs a group of other glphys but when considered as a group is a box visually separated from another group of glphys but actually part of a table.

9. bux93 ◴[] No.42959942{4}[source]
OCR software like ABBY can spit out something called a "searchable PDF", which has a text layer underneath a picture of a scan. Otherwise, PDF has 'dictionaries' with arbitrary key-value pairs in them. The "Info" dictionary has some specific metadata fields like Author, and a "Font" dictionary embeds fonts, but you're free to use those dictionaries for whatever. There's also a standard to embed 'dublin core', rights management and custom metadata called XMP. Files can be embedded. You can also use comments, as PDF is a subset of postscript. When a PDF gets converted to PDF/A (by archiving software) or flattened/optimized, most of these are likely to be lost.
10. nitwit005 ◴[] No.42960357[source]
People kind of dump whatever in pdf files, so I don't think a cleaner file format would do as much as you might think.

Digital fax services will generate pdf files, for example. They're just image data dumped into a pdf. Various scanners will also do so.

11. dkjaudyeqooe ◴[] No.42960671{4}[source]
I believe it's a "hybrid PDF" but I'm not sure if there's a further standard for merely embedding text.

https://stackoverflow.com/questions/67358370/what-the-standa...

12. surfingdino ◴[] No.42961573[source]
In my experience AWS Textextract does a pretty good job without using LLMs.
13. shermantanktop ◴[] No.42963880[source]
PDFs began as just postscript commands stored in a file. It’s a genius hack in a way that has become a Frankenstein’s monster.
14. Bluestein ◴[] No.42967188[source]
... and call's it "portable", to boot.-