I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.
I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.
Data extraction is hard, but that's not what it is designed for, it is for people to read, like paper documents.
Far from being "mad", it is remarkably stable. It has some crazy features, and it is not designed for data extraction (but doesn't actively prevent it!). But look at the alternative. Word documents? Html? Svg? One of the zillion XML-based document formats? Markdown? Is any one of these suitable for writing, say, a scientific paper (with maths, tables, graphics...) in a way that is readable by a human on a computer or in print and will still be in decades and that is easier to process by a machine than a PDF?