←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 2 comments | | HN request time: 0s | source
1. edtechdev ◴[] No.45645157[source]
I tried this out on huggingface, and it has the same issue as every other multimodal AI OCR option (including MinerU, olmOCR, Gemini, ChatGPT, ...). It ignores pictures, charts, and other visual elements in a document, even though the models are pretty good at describing images and charts by themselves. What this means is that you can't use these tools yet to create fully accessible alternatives to PDFs.
replies(1): >>45645515 #
2. mediaman ◴[] No.45645515[source]
I have a lot of success asking models such as Gemini to OCR the text, and then to describe any images on the document, including charts. I have it format the sections with XML-ish tags. This also works for tables.