Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

1. themanmaran ◴[05 Apr 25 06:32 UTC] No.43591383[source]▶

> Never change the original language of any text. Keep Korean in Korean, Japanese in Japanese, and English in English.

I love the double prompting to keep GPT from translating the text. I've definitely had this problem before, and spent ages trying to prompt it into not randomly translating the text.

replies(1): >>43591581 #

2. ses425500000 ◴[05 Apr 25 07:17 UTC] No.43591581[source]▶

>>43591383 (TP) #

Yeah — I ran into that exact problem during early testing. The prompt has since been adjusted to prevent GPT from auto-translating non-English text (Korean, Japanese, etc.).

If it still misbehaves in any edge cases, feel free to open an issue on GitHub — happy to patch it up.

replies(1): >>43592829 #

3. fmbb ◴[05 Apr 25 12:19 UTC] No.43592829[source]▶

>>43591581 #

What’s the use of using generative AI to OCR the text?

replies(1): >>43593154 #

4. ses425500000 ◴[05 Apr 25 13:07 UTC] No.43593154{3}[source]▶

>>43592829 #

Great question — I’m using traditional OCR engines for the initial text extraction (e.g., MathPix, Google Vision), but then I apply generative AI models in a second stage to refine the output. This includes removing noisy or irrelevant elements, normalizing format inconsistencies, and improving alignment across multi-modal inputs.

In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.

So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.