←back to thread

357 points ingve | 2 comments | | HN request time: 0.407s | source
1. remram ◴[] No.43979397[source]
I built a simple OSS tool for qualitative data analysis, which needs to turn uploaded documents into text (stripped HTML). PDFs have been a huge problem from day one.

I have investigated many tools, but two-column layouts and footers etc often still mess up the content.

It's hard to convince my (often non-technical) users that this is a difficult problem.

replies(1): >>43982521 #
2. bartread ◴[] No.43982521[source]
Try Poppler’s pdftohtml command line tool. For me that seems to do a good job of spitting out multi-column text in the right order. Then you have the much easier task of extracting the text from the HTML.

Also, if it does come out in the wrong order for any pages you can analyse element coordinates to figure out which column each chunk of text belongs in.

(Note that you may have to deal with sub-columns if tables are present in any columns. I’ve never had this in my data but you may also find blocks that span across more than one column, either in whole or in part.)

They also have a pdftotext tool that may do the job for you if you disable its layout option. If you run it with the layout option enabled you’ll find it generates multi-column text in the output, as it tries to closely match the layout of the input PDF.

I think the pdftohtml tool is probably the way to go just because the extra metadata on each element is probably going to be helpful in determining how to treat that element, and it’s obviously relatively straightforward to strip out the HTML tags to extract plain text.