(github.com)

170 points ses425500000 | 1 comments | 05 Apr 25 05:22 UTC | HN request time: 0.199s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

1. constantinum ◴[06 Apr 25 03:01 UTC] No.43598623[source]▶

>>43590998 (OP) #

For the more curious: there is also Unstract open source for pipeline. Lets us plug in your AI stack eg. OS llm models, vector db, ocr parsers etc.

https://github.com/Zipstack/unstract

↑

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)