PDF to Text, a challenging problem

This was a great read. You've done an excellent job breaking down what makes PDFs so uniquely annoying to work with. People often underestimate how much of the “document-ness” (like headings, paragraphs, tables) is just visual, with no underlying semantic structure.

We ran into many of the same challenges while working on Docsumo, where we process business documents like invoices, bank statements, and scanned PDFs. In real-world use cases, things get even messier: inconsistent templates, rotated scans, overlapping text, or documents generated by ancient software with no tagging at all.

One thing we’ve found helpful (in addition to heuristics like font size/weight and spacing) is combining layout parsing with ML models trained to infer semantic roles (like "header", "table cell", "footer", etc.). It’s far from perfect, but it helps bridge the gap between how the document looks and what it means.

Really appreciate posts like this. PDF wrangling is a dark art more people should talk about.