←back to thread

170 points ses425500000 | 2 comments | | HN request time: 0s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

1. novaRom ◴[] No.43591697[source]
> Built With: DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, OpenCV, and more.

the whole pipeline is not open source

replies(1): >>43591823 #
2. ses425500000 ◴[] No.43591823[source]
Yep — some components currently rely on external APIs (e.g. OpenAI, MathPix), primarily for stability and ease of deployment during early release. But I’m planning to support fully local inference in the future to eliminate API key dependency.

The local pipeline would include:

• Tesseract or TrOCR for general OCR

• Pix2Struct, Donut, or DocTR for document structure understanding

• OpenAI CLIP for image-text semantic alignment

• Gemma / Phi / LLaMA / Mistral for downstream reasoning tasks

Goal is to make the system fully self-hostable for offline and private use.