PDF to Text, a challenging problem

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

There's also #4, reliable OCR and semantics extraction that works across many diverse classes of documents, which is relevant for accessibility.

This is hard because:

1. Unlike a business workflow which often only deals with a few specific kinds of documents, you never know what the user is going to get. You're making an abstract PDF reader, not an app that can process court documents in bankruptcy cases in Delaware.

2. You don't just need the text (like in traditional OCR), you need to recognize tables, page headers and footers, footnotes, headings, mathematics etc.

3. Because this is for human consumption, you want to minimize errors as much as possible, which means not using OCR when not needed, and relying on the underlying text embedded within the PDF while still extracting semantics. This means you essentially need two different paths, when the PDF only consists of images and when there are content streams you can get some information from.

3.1. But the content streams may contain different text from what's actually on the page, e.g. white-on-white text to hide information the user isn't supposed to see, or diacritics emulation with commands that manually draw acute accents instead of using proper unicode diacritics (LaTeX works that way).

4. You're likely running as a local app on the user's (possibly very underpowered) device, and likely don't have an associated server and subscription, so you can't use any cloud AI models.

5. You need to support forms. Since the user is using accessibility software, presumably they can't print and use a pen, so you need to handle the ones meant for printing too, not just the nice, spec-compatible ones.

This is very much an open problem and is not even remotely close to being solved. People have been taking stabs at it for years, but all current solutions suck in some way, and there's no single one that solves all 5 points correctly.