←back to thread

293 points lapnect | 10 comments | | HN request time: 0.884s | source | bottom
1. nutlope ◴[] No.42155007[source]
Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!

replies(5): >>42155235 #>>42155376 #>>42155942 #>>42158372 #>>42159434 #
2. Curiositry ◴[] No.42155235[source]
Option to use a local LLM?
replies(1): >>42155548 #
3. nh2 ◴[] No.42155376[source]
I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)

replies(1): >>42156858 #
4. Eisenstein ◴[] No.42155548[source]
I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR

replies(1): >>42155615 #
5. nirav72 ◴[] No.42155615{3}[source]
MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.
replies(1): >>42163822 #
6. Szpadel ◴[] No.42155942[source]
> Need an example image? Try ours. Great idea, I wish more services would have similar feature
7. zainia ◴[] No.42156858[source]
Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...
8. gcr ◴[] No.42158372[source]
How accurate is this?

When compared with existing OCR systems, what sorts of mistakes does it make?

9. rch ◴[] No.42159434[source]
I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.

Have you considered that usage yet?

10. timmattison ◴[] No.42163822{4}[source]
I love this. Can you share the source?