←back to thread

357 points ingve | 1 comments | | HN request time: 0.243s | source
Show context
xnx ◴[] No.43974208[source]
Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.
replies(5): >>43974229 #>>43974325 #>>43974337 #>>43974562 #>>43975686 #
1. simonw ◴[] No.43974325[source]
I've had great results against PDFs from recent vision models. Gemini, OpenAI and Claude can all accept PDFs directly now and treat them as image input.

For longer PDFs I've found that breaking them up into images per page and treating each page separately works well - feeing a thousand page PDF to even a long context model like Gemini 2.5 Pro or Flash still isn't reliable enough that I trust it.

As always though, the big challenge of using vision LLMs for OCR (or audio transcription) tasks is the risk of accidental instruction following - even more so if there's a risk of deliberately malicious instructions in the documents you are processing.