PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 4 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

noosphr ◴[13 May 25 20:06 UTC] No.43977112[source]▶

I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.

The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.

What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: https://github.com/DS4SD/DocLayNet if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.

Here's some pre-trained models that work OK out of the box: https://github.com/ppaanngggg/yolo-doclaynet I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.

VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?

replies(2): >>43978106 #>>43982262 #

cess11 ◴[14 May 25 08:35 UTC] No.43982262[source]▶

>>43977112 #

That looks like a pretty good starting point, thanks. I've been dabbling in vision models but need a much higher degree of accuracy than they seem able to provide, opting instead for more traditional techniques and handling errors manually.

replies(1): >>43982432 #

1. noosphr ◴[14 May 25 09:01 UTC] No.43982432[source]▶

>>43982262 #

For non-table documents a fine tuned yolov8 + tesseract with _good_ image pre-processing has basically a zero percent error rate on monolingual texts. I say basically because the training data has worse labels than what the multi-model system gives out in the cases that I double checked manually.

But no one reads the manual on tesseract and everyone ends up feeding it garbage, with predictable results.

Tables are an open research problem.

We started training a custom version of this model: https://arxiv.org/pdf/2309.14962 but there wasn't the business case since the bert search model dealt well enough with the word soup that came out of easy ocr. If you're interested drop a line. I'd love to get a model like that trained since it's very low hanging fruit that no one has done right.

replies(2): >>43983203 #>>44013138 #

2. cess11 ◴[14 May 25 11:29 UTC] No.43983203[source]▶

>>43982432 (TP) #

Thanks, that's interesting research, I'll look into it.

3. opyate ◴[17 May 25 09:27 UTC] No.44013138[source]▶

>>43982432 (TP) #

The first thing I did when I saw this thread was ctrl-f for doclaynet :)

I've been at this problem since 2013, and a few years ago turned my findings into more of a consultancy than a product. See HTTPS://pdfcrun.ch

However, due to various events, I burned out recently and took a permie job, so would love to stick my head in the sand and play video games in my spare time, but secretly hoping you'd see this and to hear about your work.

replies(1): >>44014488 #

4. noosphr ◴[17 May 25 14:15 UTC] No.44014488[source]▶

>>44013138 #

There's not much to say.

Doclaynet is the easy part and with triple the usual resolution the previous gen of yolo models have solved document segmentation for every document I've looked at.

The hard part is the table segmentation. I don't have the budget to do a proper exploration of hyper parameters for the gridformer models before starting a $50,000 training run.

This is a back burner project along with speaker diarization. I have no idea why those haven't been solved since they are very low hanging fruit that would release tens of millions in productivity when deployed at scale, but regardless I can't justify buying a Nvidia DGX H200 and spending two months exploring architectures for each.

↑