←back to thread

170 points ses425500000 | 1 comments | | HN request time: 0.215s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context
aghilmort ◴[] No.43596947[source]
super great work -- do you convert math formula to latex &/or how is that or other symbolic not necessarily unicode chars handled?
replies(1): >>43598223 #
1. ses425500000 ◴[] No.43598223[source]
Thanks a lot! Yeah, theoretically the pipeline handles math and special symbols fine, and from my testing it worked well. But I didn’t test much on other languages or encodings, so if there’s any weird behavior, please let me know and I’ll check it!