(github.com)

170 points ses425500000 | 2 comments | 05 Apr 25 05:22 UTC | HN request time: 0s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context

GPerson ◴[05 Apr 25 14:27 UTC] No.43593797[source]▶

>>43590998 (OP) #

Did you ethically acquire permission to train on the data set?

replies(1): >>43593904 #

ses425500000 ◴[05 Apr 25 14:45 UTC] No.43593904[source]▶

>>43593797 #

Yep — this project uses a pre-trained DocLayout-YOLO model released under an open license by the original authors. No additional datasets were used for training. All sample data in the repo is either synthetic, publicly available, or user-generated specifically for testing purposes. If there are any concerns about specific models or datasets, I’m happy to review them and make adjustments as needed.

replies(1): >>43595236 #

1. sc077y ◴[05 Apr 25 17:53 UTC] No.43595236[source]▶

>>43593904 #

DocLayout-YOLO model is under the AGPL-3.0 license, it's not permissive. You can't have your project under the MIT license and also use copyleft software.

replies(1): >>43598245 #

2. ses425500000 ◴[06 Apr 25 01:31 UTC] No.43598245[source]▶

>>43595236 (TP) #

I’m sorry that I didn’t know that detail, thank you so much for letting me know! I’ll read AGPL-3.0 license more carefully and check if it’s okay with MIT. If not, I’ll fix license or change model. really appreciate your help!

↑

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)