PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

noosphr ◴[13 May 25 20:06 UTC] No.43977112[source]▶

I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.

The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.

What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: https://github.com/DS4SD/DocLayNet if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.

Here's some pre-trained models that work OK out of the box: https://github.com/ppaanngggg/yolo-doclaynet I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.

VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?

replies(2): >>43978106 #>>43982262 #

cess11 ◴[14 May 25 08:35 UTC] No.43982262[source]▶

>>43977112 #

That looks like a pretty good starting point, thanks. I've been dabbling in vision models but need a much higher degree of accuracy than they seem able to provide, opting instead for more traditional techniques and handling errors manually.

replies(1): >>43982432 #

noosphr ◴[14 May 25 09:01 UTC] No.43982432[source]▶

>>43982262 #

For non-table documents a fine tuned yolov8 + tesseract with _good_ image pre-processing has basically a zero percent error rate on monolingual texts. I say basically because the training data has worse labels than what the multi-model system gives out in the cases that I double checked manually.

But no one reads the manual on tesseract and everyone ends up feeding it garbage, with predictable results.

Tables are an open research problem.

We started training a custom version of this model: https://arxiv.org/pdf/2309.14962 but there wasn't the business case since the bert search model dealt well enough with the word soup that came out of easy ocr. If you're interested drop a line. I'd love to get a model like that trained since it's very low hanging fruit that no one has done right.

replies(2): >>43983203 #>>44013138 #

1. cess11 ◴[14 May 25 11:29 UTC] No.43983203{3}[source]▶

>>43982432 #

Thanks, that's interesting research, I'll look into it.

↑