(www.sergey.fyi)

1303 points serjester | 5 comments | 05 Feb 25 18:05 UTC | HN request time: 0.637s | source

1. nickandbro ◴[05 Feb 25 19:41 UTC] No.42953976[source]▶

I think very soon a new model will destroy whatever startups and services are built around document ingestion. As in a model that can take in a pdf page as a image and transcribe it to text with near perfect accuracy.

replies(2): >>42954513 #>>42955074 #

2. depr ◴[05 Feb 25 20:18 UTC] No.42954513[source]▶

>>42953976 (TP) #

I think the Azure Document Intelligence, Google Document AI and Amazon Textract are among the best if not the best services though and they offer these models.

replies(1): >>42959514 #

3. layer8 ◴[05 Feb 25 21:02 UTC] No.42955074[source]▶

>>42953976 (TP) #

Extracting plain text isn’t that much of a problem, relatively speaking. It’s interpreting more complex elements like nested lists, tables, side bars, footnotes/endnotes, cross-references, images and diagrams where things get challenging.

replies(1): >>42959407 #

4. visarga ◴[06 Feb 25 05:33 UTC] No.42959407[source]▶

>>42955074 #

OCR is not 100% either. Reading order is also fragile, it might OCR the word but mess up the line structure.

5. nnurmanov ◴[06 Feb 25 06:00 UTC] No.42959514[source]▶

>>42954513 #

I have not tested Azure Document Intelligence, Google Document AI, but AWS Textract, LLamaparse, Unstructured and Omni made to my shortlist. I have not tested Docling, as I could not install it on my Windows laptop.

↑

Ingesting PDFs and why Gemini 2.0 changes everything