PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 2 comments | 13 May 25 15:01 UTC | HN request time: 0.001s | source

Show context

dwheeler ◴[13 May 25 16:22 UTC] No.43974621[source]▶

The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.

replies(5): >>43974667 #>>43974983 #>>43975217 #>>43975401 #>>43976216 #

lelandfe ◴[13 May 25 17:25 UTC] No.43975401[source]▶

>>43974621 #

The better solution to a search engine extracting text from existing PDFs is to provide advice on how to author PDFs?

What's the timeline for this solution to pay off

replies(1): >>43976378 #

chaps ◴[13 May 25 18:58 UTC] No.43976378[source]▶

>>43975401 #

Microsoft is one of the bigger contributors to this. Like -- why does excel have a feature to export to PDF, but not a feature to do the opposite? That export functionality really feels like it was given to a summer intern who finished it in two weeks and never had to deal with it ever again.

replies(2): >>43978047 #>>43980433 #

1. bartread ◴[14 May 25 03:24 UTC] No.43980433[source]▶

>>43976378 #

It does have a feature to do the opposite. You can, in theory, extract tabular data from PDFs with Excel (note: only on the Windows version; this function isn’t available in macOS Excel).

In practice I’ve found it to be extremely unreliable, and I suspect this may be because the optional metadata that semantically defines a table as a table is missing from the errant PDF. It’ll still look like a table when rendered, but there’s nothing that defines it as such. It’s just a bunch of graphical and text elements that, when rendered, happen to look like a table.

replies(1): >>43997734 #

2. chaps ◴[15 May 25 18:16 UTC] No.43997734[source]▶

>>43980433 (TP) #

Yeah. The "extremely unreliable" part of that is the stinker. Some of the exports I get through FOIA are thousands and thousands of pages, so the unreliability really compounds really quickly. It's frustrating, because there are many things Microsoft could do with PDFs to make that a non-problem. But it's consistently been a naive implementation that doesn't consider newlines.

↑