Llama-OCR: Document to Markdown

1. nutlope ◴[16 Nov 24 07:16 UTC] No.42155007[source]▶

Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!

replies(5): >>42155235 #>>42155376 #>>42155942 #>>42158372 #>>42159434 #

2. Curiositry ◴[16 Nov 24 08:20 UTC] No.42155235[source]▶

>>42155007 (TP) #

Option to use a local LLM?

replies(1): >>42155548 #

3. nh2 ◴[16 Nov 24 09:00 UTC] No.42155376[source]▶

>>42155007 (TP) #

I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)

replies(1): >>42156858 #

4. Eisenstein ◴[16 Nov 24 09:47 UTC] No.42155548[source]▶

>>42155235 #

I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR

replies(1): >>42155615 #

5. nirav72 ◴[16 Nov 24 10:08 UTC] No.42155615{3}[source]▶

>>42155548 #

MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.

replies(1): >>42163822 #

6. Szpadel ◴[16 Nov 24 11:55 UTC] No.42155942[source]▶

>>42155007 (TP) #

> Need an example image? Try ours. Great idea, I wish more services would have similar feature

7. zainia ◴[16 Nov 24 15:18 UTC] No.42156858[source]▶

>>42155376 #

Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...

8. gcr ◴[16 Nov 24 19:00 UTC] No.42158372[source]▶

>>42155007 (TP) #

How accurate is this?

When compared with existing OCR systems, what sorts of mistakes does it make?

9. rch ◴[16 Nov 24 21:18 UTC] No.42159434[source]▶

>>42155007 (TP) #

I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.

Have you considered that usage yet?

10. timmattison ◴[17 Nov 24 12:22 UTC] No.42163822{4}[source]▶

>>42155615 #

I love this. Can you share the source?