←back to thread

357 points ingve | 2 comments | | HN request time: 0s | source
Show context
bartread ◴[] No.43974140[source]
Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

replies(4): >>43974220 #>>43976252 #>>43976596 #>>43984384 #
hermitcrab ◴[] No.43976596[source]
I am hoping at some point to be able to extract tabular data from PDFs for my data wrangling software. If anyone knows of a library that can extract tables from PDFs, can be inegrated into a C++ app and is free or less than a few hundred $, please let me know!
replies(1): >>43979022 #
1. ______ ◴[] No.43979022[source]
pdfplumber is great for table extraction but it is python
replies(1): >>43984345 #
2. hermitcrab ◴[] No.43984345[source]
Thanks, but I prefer to keep everything C++ for simplicity and speed.