(github.com)

170 points ses425500000 | 1 comments | 05 Apr 25 05:22 UTC | HN request time: 0.201s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context

aghilmort ◴[05 Apr 25 21:25 UTC] No.43596947[source]▶

>>43590998 (OP) #

super great work -- do you convert math formula to latex &/or how is that or other symbolic not necessarily unicode chars handled?

replies(1): >>43598223 #

1. ses425500000 ◴[06 Apr 25 01:27 UTC] No.43598223[source]▶

>>43596947 #

Thanks a lot! Yeah, theoretically the pipeline handles math and special symbols fine, and from my testing it worked well. But I didn’t test much on other languages or encodings, so if there’s any weird behavior, please let me know and I’ll check it!

↑

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)