the whole pipeline is not open source
I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.
Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks
Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.
The local pipeline would include:
• Tesseract or TrOCR for general OCR
• Pix2Struct, Donut, or DocTR for document structure understanding
• OpenAI CLIP for image-text semantic alignment
• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks
Goal is to make the system fully self-hostable for offline and private use.