(github.com)

170 points ses425500000 | 2 comments | 05 Apr 25 05:22 UTC | HN request time: 0.444s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

1. liangzhe88 ◴[05 Apr 25 10:56 UTC] No.43592408[source]▶

>>43590998 (OP) #

Curious if there are plans to update this. Seems interesting.

replies(1): >>43592460 #

2. ses425500000 ◴[05 Apr 25 11:09 UTC] No.43592460[source]▶

>>43592408 (TP) #

Thanks! Yes — I’m definitely planning to update and refine the project over time.

This initial release is mostly a working prototype to demonstrate the full pipeline logic, and I’ll continue improving stability, modularity, and usability. A lot more updates are in the pipeline, so stay tuned! Feel free to open issues or suggestions anytime — feedback is always welcome!

↑

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)