←back to thread

170 points ses425500000 | 2 comments | | HN request time: 0.863s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

1. sandreas ◴[] No.43592384[source]
How does this compare against marker[1]?

1: https://github.com/VikParuchuri/marker

replies(1): >>43592452 #
2. ses425500000 ◴[] No.43592452[source]
Thanks for sharing — Marker is a great tool, especially for human-readable formatting!

In contrast, this project focuses less on preserving the visual layout for human readers, and more on extracting structured semantic data for machine learning training.

So instead of optimizing for clean Markdown or HTML, it extracts context-aware elements like:

• table data as JSON,

• math expressions in LaTeX,

• diagrams with image descriptions,

• multilingual text segments,

• and semantic roles (e.g. “question”, “explanation”, etc.)

In short: Marker is great for reading, This is built for feeding into ML pipelines — especially for tasks like question-answering, diagram reasoning, or multimodal pretraining.

replies(1): >>43593465 #