Llama-OCR: Document to Markdown

(llamaocr.com)

293 points lapnect | 2 comments | 16 Nov 24 04:57 UTC | HN request time: 0.555s | source

Show context

notsylver ◴[16 Nov 24 06:34 UTC] No.42154841[source]▶

I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(

replies(7): >>42154901 #>>42155002 #>>42155087 #>>42155372 #>>42155438 #>>42156428 #>>42156646 #

danvk ◴[16 Nov 24 14:35 UTC] No.42156646[source]▶

>>42154841 #

I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).

I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.

replies(1): >>42156712 #

1. pbhjpbhj ◴[16 Nov 24 14:52 UTC] No.42156712[source]▶

>>42156646 #

Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?

Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?

replies(1): >>42163981 #

2. danvk ◴[17 Nov 24 13:00 UTC] No.42163981[source]▶

>>42156712 (TP) #

I have not, but that's a great idea!

↑