PDF to Text, a challenging problem

1. kbyatnal ◴[13 May 25 18:00 UTC] No.43975807[source]▶

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

replies(3): >>43976203 #>>43976790 #>>43977158 #

2. varunneal ◴[13 May 25 18:40 UTC] No.43976203[source]▶

>>43975807 (TP) #

I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction.

replies(1): >>43979129 #

3. miki123211 ◴[13 May 25 19:38 UTC] No.43976790[source]▶

>>43975807 (TP) #

There's also #4, reliable OCR and semantics extraction that works across many diverse classes of documents, which is relevant for accessibility.

This is hard because:

1. Unlike a business workflow which often only deals with a few specific kinds of documents, you never know what the user is going to get. You're making an abstract PDF reader, not an app that can process court documents in bankruptcy cases in Delaware.

2. You don't just need the text (like in traditional OCR), you need to recognize tables, page headers and footers, footnotes, headings, mathematics etc.

3. Because this is for human consumption, you want to minimize errors as much as possible, which means not using OCR when not needed, and relying on the underlying text embedded within the PDF while still extracting semantics. This means you essentially need two different paths, when the PDF only consists of images and when there are content streams you can get some information from.

3.1. But the content streams may contain different text from what's actually on the page, e.g. white-on-white text to hide information the user isn't supposed to see, or diacritics emulation with commands that manually draw acute accents instead of using proper unicode diacritics (LaTeX works that way).

4. You're likely running as a local app on the user's (possibly very underpowered) device, and likely don't have an associated server and subscription, so you can't use any cloud AI models.

5. You need to support forms. Since the user is using accessibility software, presumably they can't print and use a pen, so you need to handle the ones meant for printing too, not just the nice, spec-compatible ones.

This is very much an open problem and is not even remotely close to being solved. People have been taking stabs at it for years, but all current solutions suck in some way, and there's no single one that solves all 5 points correctly.

4. noosphr ◴[13 May 25 20:10 UTC] No.43977158[source]▶

>>43975807 (TP) #

>replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

As someone who had to build custom tools because VLMs are so unreliable: anyone that uses VLMs for unprocessed images is in for more pain than all the providers which let LLMs without guard rails interact directly with consumers.

They are very good at image labeling. They are ok at very simple documents, e.g. single column text, centered single level of headings, one image or table per page, etc. (which is what all the MVP demos show). They need another trillion parameters to become bad at complex documents with tables and images.

Right now they hallucinate so badly that you simply _can't_ use them for something as simple as a table with a heading at the top, data in the middle and a summary at the bottom.

replies(1): >>43978998 #

5. th0ma5 ◴[13 May 25 23:26 UTC] No.43978998[source]▶

>>43977158 #

I wish I could upvote you more. The compounding errors of these document solutions preclude what people assume must be possible.

6. dstryr ◴[13 May 25 23:44 UTC] No.43979129[source]▶

>>43976203 #

Give this project a try. I've been using it with promising results.

https://github.com/matthsena/AlcheMark

replies(2): >>43981025 #>>43984987 #

7. aorth ◴[14 May 25 05:05 UTC] No.43981025{3}[source]▶

>>43979129 #

I tried with one PDF and was surprised to see it connect to some cloud service:

  2025-05-14 07:58:49,373 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
  2025-05-14 07:58:50,446 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/o200k_base.tiktoken HTTP/1.1" 200 361 3922

The project's README doesn't mention that anywhere...

replies(1): >>43981478 #

8. degamad ◴[14 May 25 06:26 UTC] No.43981478{4}[source]▶

>>43981025 #

The project's README mentions that it uses tiktoken[0], which is a separate project created by OpenAI.

tiktoken downloads token models the first time you use them, but it does not mention that. It does cache the models, so you shouldn't see more of those connections, if I'm understanding the code correctly.

[0] <https://github.com/openai/tiktoken>

9. varunneal ◴[14 May 25 14:29 UTC] No.43984987{3}[source]▶

>>43979129 #

I'll check it out!