PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

kbyatnal ◴[13 May 25 18:00 UTC] No.43975807[source]▶

>>43973721 (OP) #

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

replies(3): >>43976203 #>>43976790 #>>43977158 #

noosphr ◴[13 May 25 20:10 UTC] No.43977158[source]▶

>>43975807 #

>replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

As someone who had to build custom tools because VLMs are so unreliable: anyone that uses VLMs for unprocessed images is in for more pain than all the providers which let LLMs without guard rails interact directly with consumers.

They are very good at image labeling. They are ok at very simple documents, e.g. single column text, centered single level of headings, one image or table per page, etc. (which is what all the MVP demos show). They need another trillion parameters to become bad at complex documents with tables and images.

Right now they hallucinate so badly that you simply _can't_ use them for something as simple as a table with a heading at the top, data in the middle and a summary at the bottom.

replies(1): >>43978998 #

1. th0ma5 ◴[13 May 25 23:26 UTC] No.43978998[source]▶

>>43977158 #

I wish I could upvote you more. The compounding errors of these document solutions preclude what people assume must be possible.

↑