←back to thread

293 points lapnect | 2 comments | | HN request time: 0.427s | source
Show context
AmazingTurtle ◴[] No.42155326[source]
One can combine apache tika OCR and feed it together with the image into LLM to fix typos.
replies(1): >>42156396 #
cess11 ◴[] No.42156396[source]
While I'm a fan of Tika a lot of people get queasy from Java and XML, they might be better served by their preferred scripting language and https://github.com/ocrmypdf/OCRmyPDF, which has the same OCR engine.
replies(1): >>42163052 #
1. AmazingTurtle ◴[] No.42163052[source]
May I introduce you to `apache/tika:2.9.2.1-full` with a REST API on 9998.
replies(1): >>42163432 #
2. cess11 ◴[] No.42163432[source]
Not sure what you mean. Are they making Graal-builds you can run standalone now? I only use Tika through Maven at work, might not be up to date on what happens in the project.