←back to thread

293 points lapnect | 2 comments | | HN request time: 0.985s | source
Show context
notsylver ◴[] No.42154841[source]
I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(

replies(7): >>42154901 #>>42155002 #>>42155087 #>>42155372 #>>42155438 #>>42156428 #>>42156646 #
1. philips ◴[] No.42155087[source]
Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.

convert -density 76 input.pdf output-%d.png

https://github.com/philips/paper-bidsheets

replies(1): >>42155225 #
2. notsylver ◴[] No.42155225[source]
That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.

Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much