←back to thread

1303 points serjester | 1 comments | | HN request time: 0s | source
Show context
nickandbro ◴[] No.42953976[source]
I think very soon a new model will destroy whatever startups and services are built around document ingestion. As in a model that can take in a pdf page as a image and transcribe it to text with near perfect accuracy.
replies(2): >>42954513 #>>42955074 #
layer8 ◴[] No.42955074[source]
Extracting plain text isn’t that much of a problem, relatively speaking. It’s interpreting more complex elements like nested lists, tables, side bars, footnotes/endnotes, cross-references, images and diagrams where things get challenging.
replies(1): >>42959407 #
1. visarga ◴[] No.42959407[source]
OCR is not 100% either. Reading order is also fragile, it might OCR the word but mess up the line structure.