←back to thread

293 points lapnect | 5 comments | | HN request time: 0.975s | source
Show context
notsylver ◴[] No.42154841[source]
I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(

replies(7): >>42154901 #>>42155002 #>>42155087 #>>42155372 #>>42155438 #>>42156428 #>>42156646 #
1. 8n4vidtmkvmk ◴[] No.42155002[source]
That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.
replies(2): >>42155142 #>>42155260 #
2. bosie ◴[] No.42155142[source]
Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?

Terrascan comes to mind

replies(1): >>42159947 #
3. notsylver ◴[] No.42155260[source]
I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs

I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.

The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.

4. 8n4vidtmkvmk ◴[] No.42159947[source]
"Terrascan" is a vision model? The only hits I'm getting are for a static code analyzer.
replies(1): >>42176149 #
5. bosie ◴[] No.42176149{3}[source]
sorry, i meant "Tesseract"