Llama-OCR: Document to Markdown

1. notsylver ◴[16 Nov 24 06:34 UTC] No.42154841[source]▶

I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(

replies(7): >>42154901 #>>42155002 #>>42155087 #>>42155372 #>>42155438 #>>42156428 #>>42156646 #

2. og_kalu ◴[16 Nov 24 06:50 UTC] No.42154901[source]▶

>>42154841 (TP) #

>Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.

For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?

Interesting about Flash, what LLMs did you test ?

replies(2): >>42155032 #>>42156731 #

3. 8n4vidtmkvmk ◴[16 Nov 24 07:15 UTC] No.42155002[source]▶

>>42154841 (TP) #

That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.

replies(2): >>42155142 #>>42155260 #

4. notsylver ◴[16 Nov 24 07:23 UTC] No.42155032[source]▶

>>42154901 #

I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.

I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.

I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.

replies(2): >>42155515 #>>42155819 #

5. philips ◴[16 Nov 24 07:37 UTC] No.42155087[source]▶

>>42154841 (TP) #

Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.

convert -density 76 input.pdf output-%d.png

https://github.com/philips/paper-bidsheets

replies(1): >>42155225 #

6. bosie ◴[16 Nov 24 07:55 UTC] No.42155142[source]▶

>>42155002 #

Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?

Terrascan comes to mind

replies(1): >>42159947 #

7. notsylver ◴[16 Nov 24 08:16 UTC] No.42155225[source]▶

>>42155087 #

That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.

Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much

8. notsylver ◴[16 Nov 24 08:29 UTC] No.42155260[source]▶

>>42155002 #

I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs

I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.

The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.

9. ◴[16 Nov 24 08:59 UTC] No.42155372[source]▶

>>42154841 (TP) #

10. bboygravity ◴[16 Nov 24 09:18 UTC] No.42155438[source]▶

>>42154841 (TP) #

Have you tried Claude?

It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.

11. dleeftink ◴[16 Nov 24 11:09 UTC] No.42155819{3}[source]▶

>>42155032 #

WordNinja is pretty good as a post-processing step on wrongly split/concatenated words:

[0]: https://github.com/keredson/wordninja

12. ◴[16 Nov 24 13:53 UTC] No.42156428[source]▶

>>42154841 (TP) #

13. danvk ◴[16 Nov 24 14:35 UTC] No.42156646[source]▶

>>42154841 (TP) #

I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).

I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.

replies(1): >>42156712 #

14. pbhjpbhj ◴[16 Nov 24 14:52 UTC] No.42156712[source]▶

>>42156646 #

Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?

Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?

replies(1): >>42163981 #

15. pbhjpbhj ◴[16 Nov 24 14:56 UTC] No.42156731[source]▶

>>42154901 #

The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.

16. 8n4vidtmkvmk ◴[16 Nov 24 22:27 UTC] No.42159947{3}[source]▶

>>42155142 #

"Terrascan" is a vision model? The only hits I'm getting are for a static code analyzer.

replies(1): >>42176149 #

17. danvk ◴[17 Nov 24 13:00 UTC] No.42163981{3}[source]▶

>>42156712 #

I have not, but that's a great idea!

18. bosie ◴[18 Nov 24 19:46 UTC] No.42176149{4}[source]▶

>>42159947 #

sorry, i meant "Tesseract"