PDF to Text, a challenging problem

1. xnx ◴[13 May 25 15:44 UTC] No.43974208[source]▶

Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.

replies(5): >>43974229 #>>43974325 #>>43974337 #>>43974562 #>>43975686 #

2. j45 ◴[13 May 25 15:46 UTC] No.43974229[source]▶

>>43974208 (TP) #

LLMs are definitely helping approach some problems that couldn't be to date.

3. simonw ◴[13 May 25 15:56 UTC] No.43974325[source]▶

>>43974208 (TP) #

I've had great results against PDFs from recent vision models. Gemini, OpenAI and Claude can all accept PDFs directly now and treat them as image input.

For longer PDFs I've found that breaking them up into images per page and treating each page separately works well - feeing a thousand page PDF to even a long context model like Gemini 2.5 Pro or Flash still isn't reliable enough that I trust it.

As always though, the big challenge of using vision LLMs for OCR (or audio transcription) tasks is the risk of accidental instruction following - even more so if there's a risk of deliberately malicious instructions in the documents you are processing.

4. marginalia_nu ◴[13 May 25 15:57 UTC] No.43974337[source]▶

>>43974208 (TP) #

Author here: LLMs are definitely the new gold standard for smaller bodies of shorter documents.

The article is in the context of an internet search engine, the corpus to be converted is of order 1 TB. Running that amount of data through an LLM would be extremely expensive, given the relatively marginal improvement in outcome.

replies(2): >>43974639 #>>43977353 #

5. ◴[13 May 25 16:16 UTC] No.43974562[source]▶

>>43974208 (TP) #

6. mediaman ◴[13 May 25 16:23 UTC] No.43974639[source]▶

>>43974337 #

Corpus size doesn't mean much in the context of a PDF, given how variable that can be per page.

I've found Google's Flash to cut my OCR costs by about 95+% compared to traditional commercial offerings that support structured data extraction, and I still get tables, headers, etc from each page. Still not perfect, but per page costs were less than one tenth of a cent per page, and 100 gb collections of PDFs ran to a few hundreds of dollars.

7. constantinum ◴[13 May 25 17:50 UTC] No.43975686[source]▶

>>43974208 (TP) #

True indeed, but there are a few problems — hallucinations and trusting the output(validation). More here https://unstract.com/blog/why-llms-struggle-with-unstructure...

8. noosphr ◴[13 May 25 20:30 UTC] No.43977353[source]▶

>>43974337 #

A PDF corpus with a size of 1tb can mean anything from 10,000 really poorly scanned documents to 1,000,000,000 nicely generated latex pdfs. What matters is the number of documents, and the number of pages per document.

For the first I can run a segmentation model + traditional OCR in a day or two for the cost of warming my office in winter. For the second you'd need a few hundred dollars and a cloud server.

Feel free to reach out. I'd be happy to have a chat and do some pro-bono work for someone building a open source tool chain and index for the rest of us.