(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0.239s | source

Show context

xnx ◴[13 May 25 15:44 UTC] No.43974208[source]▶

Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.

replies(5): >>43974229 #>>43974325 #>>43974337 #>>43974562 #>>43975686 #

marginalia_nu ◴[13 May 25 15:57 UTC] No.43974337[source]▶

>>43974208 #

Author here: LLMs are definitely the new gold standard for smaller bodies of shorter documents.

The article is in the context of an internet search engine, the corpus to be converted is of order 1 TB. Running that amount of data through an LLM would be extremely expensive, given the relatively marginal improvement in outcome.

replies(2): >>43974639 #>>43977353 #

1. mediaman ◴[13 May 25 16:23 UTC] No.43974639[source]▶

>>43974337 #

Corpus size doesn't mean much in the context of a PDF, given how variable that can be per page.

I've found Google's Flash to cut my OCR costs by about 95+% compared to traditional commercial offerings that support structured data extraction, and I still get tables, headers, etc from each page. Still not perfect, but per page costs were less than one tenth of a cent per page, and 100 gb collections of PDFs ran to a few hundreds of dollars.

↑

PDF to Text, a challenging problem