PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 4 comments | 13 May 25 15:01 UTC | HN request time: 0.499s | source

Show context

90s_dev ◴[13 May 25 18:18 UTC] No.43975996[source]▶

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

replies(10): >>43976023 #>>43976086 #>>43976796 #>>43977155 #>>43977580 #>>43977605 #>>43978171 #>>43979324 #>>43980118 #>>43981115 #

hallman76 ◴[14 May 25 02:32 UTC] No.43980118[source]▶

>>43975996 #

We will never get back the collective man-decades of time that has been burned by this format. When will the madness stop?

replies(3): >>43980258 #>>43980692 #>>43984513 #

theamk ◴[14 May 25 02:54 UTC] No.43980258[source]▶

>>43980118 #

When we get an alternative that can:

(1) be stored in a single file

(2) Allow tables, images and anything else that can be shown on a piece paper

(3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper

(4) won't require Javascript or access to external sites

that means never.. We've got lucky we at least got PDF before "web designers" made (3) impossible, and marketers made (4) impossible

replies(7): >>43980448 #>>43980591 #>>43981173 #>>43981225 #>>43981712 #>>43982818 #>>43984377 #

1. dqv ◴[14 May 25 05:39 UTC] No.43981173[source]▶

>>43980258 #

> (3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper

> (4) won't require Javascript or access to external sites

So about that... https://opensource.adobe.com/dc-acrobat-sdk-docs/library/jsa...

replies(3): >>43982954 #>>43984064 #>>43998265 #

2. lmz ◴[14 May 25 10:47 UTC] No.43982954[source]▶

>>43981173 (TP) #

Also: https://pdfa.org/3d-pdf-showcase/

3. izacus ◴[14 May 25 13:05 UTC] No.43984064[source]▶

>>43981173 (TP) #

Did you miss the meaning of the word "require"?

4. theamk ◴[15 May 25 19:08 UTC] No.43998265[source]▶

>>43981173 (TP) #

that's the power of legacy. Adobe may think they can add junk to PDF like Javascript support, or lmz's "3D PDF" link below, but since PDFs viewers have a diverse ecosystem, those features won't have a great adoption.

And this is actually pretty great, maybe even the best part of PDFs! Companies _know_ that publishing PDF that require 3d-graphics or Javascript means many people won't be able to see them, so they publish good, static PDFs, maintaining virtuous cycle.

↑