←back to thread

170 points ses425500000 | 1 comments | | HN request time: 0.213s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context
GPerson ◴[] No.43593797[source]
Did you ethically acquire permission to train on the data set?
replies(1): >>43593904 #
ses425500000 ◴[] No.43593904[source]
Yep — this project uses a pre-trained DocLayout-YOLO model released under an open license by the original authors. No additional datasets were used for training. All sample data in the repo is either synthetic, publicly available, or user-generated specifically for testing purposes. If there are any concerns about specific models or datasets, I’m happy to review them and make adjustments as needed.
replies(1): >>43595236 #
sc077y ◴[] No.43595236[source]
DocLayout-YOLO model is under the AGPL-3.0 license, it's not permissive. You can't have your project under the MIT license and also use copyleft software.
replies(1): >>43598245 #
1. ses425500000 ◴[] No.43598245[source]
I’m sorry that I didn’t know that detail, thank you so much for letting me know! I’ll read AGPL-3.0 license more carefully and check if it’s okay with MIT. If not, I’ll fix license or change model. really appreciate your help!