Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.
replies(5):
The article is in the context of an internet search engine, the corpus to be converted is of order 1 TB. Running that amount of data through an LLM would be extremely expensive, given the relatively marginal improvement in outcome.
For the first I can run a segmentation model + traditional OCR in a day or two for the cost of warming my office in winter. For the second you'd need a few hundred dollars and a cloud server.
Feel free to reach out. I'd be happy to have a chat and do some pro-bono work for someone building a open source tool chain and index for the rest of us.