PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.202s | source

Show context

90s_dev ◴[13 May 25 18:18 UTC] No.43975996[source]▶

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

replies(10): >>43976023 #>>43976086 #>>43976796 #>>43977155 #>>43977580 #>>43977605 #>>43978171 #>>43979324 #>>43980118 #>>43981115 #

bazzargh ◴[13 May 25 20:10 UTC] No.43977155[source]▶

>>43975996 #

Back in... 2006ish? I got annoyed with being unable to copy text from multicolumn scientific papers on my iRex (an early ereader that was somewhat hackable) so dug a bit into why that was. Under the hood, the pdf reader used poppler, so I modified poppler to infer reading order in multicolumn documents using algorithms that tessaract's author (Thomas Breuel) had published for OCR.

It was a bit of a heuristic hack; it was 20 years ago but as I recall poppler's ancient API didn't really represent text runs in a way you'd want for an accessibility API. A version of the multicolumn select made it in but it was a pain to try to persuade poppler's maintainer that subsequent suggestions to improve performance were ok - because they used slightly different heuristics so had different text selections in some circumstances. There was no 'right' answer, so wanting the results to match didn't make sense.

And that's how kpdf got multicolumn select, of a sort.

Using tessaract directly for this has probably made more sense for some years now.

replies(1): >>43978191 #

1. steeeeeve ◴[13 May 25 21:53 UTC] No.43978191[source]▶

>>43977155 #

I too went down that rabbithole. Haha. Anything around that time to get an edge in a fantasy football league. I found a bunch of historical NFL stats pdfs and it took forever to make usable data out of them.

↑