(github.com)

990 points pierre | 1 comments | 20 Oct 25 06:26 UTC | HN request time: 0.268s | source

Show context

edtechdev ◴[20 Oct 25 15:39 UTC] No.45645157[source]▶

I tried this out on huggingface, and it has the same issue as every other multimodal AI OCR option (including MinerU, olmOCR, Gemini, ChatGPT, ...). It ignores pictures, charts, and other visual elements in a document, even though the models are pretty good at describing images and charts by themselves. What this means is that you can't use these tools yet to create fully accessible alternatives to PDFs.

replies(1): >>45645515 #

1. mediaman ◴[20 Oct 25 16:08 UTC] No.45645515[source]▶

>>45645157 #

I have a lot of success asking models such as Gemini to OCR the text, and then to describe any images on the document, including charts. I have it format the sections with XML-ish tags. This also works for tables.

↑

DeepSeek OCR