I love the double prompting to keep GPT from translating the text. I've definitely had this problem before, and spent ages trying to prompt it into not randomly translating the text.
I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.
Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks
Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.
I love the double prompting to keep GPT from translating the text. I've definitely had this problem before, and spent ages trying to prompt it into not randomly translating the text.
If it still misbehaves in any edge cases, feel free to open an issue on GitHub — happy to patch it up.
In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.
So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.