PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 5 comments | 13 May 25 15:01 UTC | HN request time: 0.953s | source

Show context

kbyatnal ◴[13 May 25 18:00 UTC] No.43975807[source]▶

>>43973721 (OP) #

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

replies(3): >>43976203 #>>43976790 #>>43977158 #

1. varunneal ◴[13 May 25 18:40 UTC] No.43976203[source]▶

>>43975807 #

I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction.

replies(1): >>43979129 #

2. dstryr ◴[13 May 25 23:44 UTC] No.43979129[source]▶

>>43976203 (TP) #

Give this project a try. I've been using it with promising results.

https://github.com/matthsena/AlcheMark

replies(2): >>43981025 #>>43984987 #

3. aorth ◴[14 May 25 05:05 UTC] No.43981025[source]▶

>>43979129 #

I tried with one PDF and was surprised to see it connect to some cloud service:

  2025-05-14 07:58:49,373 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
  2025-05-14 07:58:50,446 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/o200k_base.tiktoken HTTP/1.1" 200 361 3922

The project's README doesn't mention that anywhere...

replies(1): >>43981478 #

4. degamad ◴[14 May 25 06:26 UTC] No.43981478{3}[source]▶

>>43981025 #

The project's README mentions that it uses tiktoken[0], which is a separate project created by OpenAI.

tiktoken downloads token models the first time you use them, but it does not mention that. It does cache the models, so you shouldn't see more of those connections, if I'm understanding the code correctly.

[0] <https://github.com/openai/tiktoken>

5. varunneal ◴[14 May 25 14:29 UTC] No.43984987[source]▶

>>43979129 #

I'll check it out!

↑