←back to thread

357 points ingve | 2 comments | | HN request time: 0.002s | source
1. ted_dunning ◴[] No.43975471[source]
One of my favorite documents for highlighting the challenges described here is the PDF for this article:

https://academic.oup.com/auk/article/126/4/717/5148354

The first page is classic with two columns of text, centered headings, a text inclusion that sits between the columns and changes the line lengths and indentations for the columns. Then we get the fun of page headers that change between odd and even pages and section header conventions that vary drastically.

Oh... to make things even better, paragraphs doing get extra spacing and don't always have an indented first line.

Some of everything.

replies(1): >>43975598 #
2. JKCalhoun ◴[] No.43975598[source]
The API in CoreGraphics (MacOS) for PDF, at a basic level, simply presented the text, per page, in the order in which it was encoded in the dictionaries. And 95% of the time this was pretty good — and when working with PDFKit and Preview on the Mac, we got by with it for years.

If you stepped back you could imagine the app that originally had captured/produced the PDF — perhaps a word processor — it was likely rendering the text into the PDF context in some reasonable order from it's own text buffer(s). So even for two columns, you rather expect, and often found, that the text flowed correctly from the left column to the right. The text was therefore already in the correct order within the PDF document.

Now, footers, headers on the page — that would be anyone's guess as to what order the PDF-producing app dumped those into the PDF context.