I have investigated many tools, but two-column layouts and footers etc often still mess up the content.
It's hard to convince my (often non-technical) users that this is a difficult problem.
I have investigated many tools, but two-column layouts and footers etc often still mess up the content.
It's hard to convince my (often non-technical) users that this is a difficult problem.
Also, if it does come out in the wrong order for any pages you can analyse element coordinates to figure out which column each chunk of text belongs in.
(Note that you may have to deal with sub-columns if tables are present in any columns. I’ve never had this in my data but you may also find blocks that span across more than one column, either in whole or in part.)
They also have a pdftotext tool that may do the job for you if you disable its layout option. If you run it with the layout option enabled you’ll find it generates multi-column text in the output, as it tries to closely match the layout of the input PDF.
I think the pdftohtml tool is probably the way to go just because the extra metadata on each element is probably going to be helpful in determining how to treat that element, and it’s obviously relatively straightforward to strip out the HTML tags to extract plain text.