Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

(github.com)

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context

jlcases ◴[05 Apr 25 12:33 UTC] No.43592934[source]▶

>>43590998 (OP) #

This is a valuable contribution. The quality of ML models heavily depends on the quality of training data, and extracting structured information from unstructured documents (like PDFs) is a critical bottleneck.

A key challenge after OCR is organizing the extracted data into a coherent knowledge structure. We've seen significant improvements in downstream ML tasks when the extracted data is organized using a hierarchical, MECE (Mutually Exclusive, Collectively Exhaustive) framework. This ensures that relationships between entities (tables, diagrams, text) are explicitly captured.

Does your pipeline include capabilities for semantic structuring of the extracted content beyond basic layout analysis? That seems like the next frontier for maximizing the value of OCR data in ML training.

replies(1): >>43593169 #

ses425500000 ◴[05 Apr 25 13:09 UTC] No.43593169[source]▶

>>43592934 #

Thanks for the insightful comment! You’re absolutely right — organizing extracted data into a coherent, semantically meaningful structure is critical for high-quality ML training.

Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.

I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.

Thanks again for the thoughtful feedback!

replies(1): >>43594077 #

cAtte_ ◴[05 Apr 25 15:11 UTC] No.43594077[source]▶

>>43593169 #

why are you using an LLM to reply to every comment?

replies(2): >>43594134 #>>43597481 #

1. ses425500000 ◴[05 Apr 25 15:19 UTC] No.43594134[source]▶

>>43594077 #

Haha good catch! I’m 19 and from Korea, so I’ve been using an LLM to help with replies since my English isn’t perfect yet. But I designed and built the project myself (with help from some open models/tools) — just wanted to communicate more clearly with the community!

replies(1): >>43594839 #

2. gus_massa ◴[05 Apr 25 16:56 UTC] No.43594839[source]▶

>>43594134 (TP) #

[Hi from Argentina!] LLM have a particular style that will make people suspictious or even angry.

One posibility is to write the answer in Korean and use autotranslation. (And post only the autotranslation.) Double check the technical terms, because autotranslation sometimes choose the wrong synonym.

Another posibility is to write the answer in English inside gmail, and gmail will highlight orthographical and gramar errors. So you can fix them.

Most people here will tolerate a few mistakes if the answer has your own personal style.

(Nice project, by the way.)

replies(1): >>43597153 #

3. vo2maxer ◴[05 Apr 25 21:59 UTC] No.43597153[source]▶

>>43594839 #

Yes, writing that is suspictious makes me angry.

replies(1): >>43597994 #

4. gus_massa ◴[06 Apr 25 00:35 UTC] No.43597994{3}[source]▶

>>43597153 #

>> suspitious

:( My phone does not have orthography correction, and I didn't have my notebook.

Edit: fixed typo: gave -> have

replies(1): >>43598155 #

5. vo2maxer ◴[06 Apr 25 01:13 UTC] No.43598155{4}[source]▶

>>43597994 #

Por esa misma razón, un LLM te habría funcionado perfectamente: desplegando tus pensamientos tal como querías, pero sin las distracciones causadas por la mala ortografía o los errores gramaticales. Los LLM son herramientas —como bien sabes— que ya son esenciales y lo serán aún más con el paso del tiempo. Que algunos en esta plataforma se irriten por su uso solo significa que, eventualmente, se convertirán en los dinosaurios del futuro.

For that very reason, an LLM would have worked perfectly for you: laying out your thoughts just as you intended, but without the distractions caused by poor spelling or grammatical mistakes. LLMs are tools—as you well know—that are already essential and will become even more so over time. The fact that some people on this platform get irritated by their use just means they’ll eventually become the dinosaurs of the future.

replies(1): >>43601291 #

6. gus_massa ◴[06 Apr 25 13:26 UTC] No.43601291{5}[source]▶

>>43598155 #

This reads as es-es (perhaps es-es-corporate) instead of es-ar. I don't like "desplegando" because it's somewhat closer to "unfolding" instead of "laying out". I'm not sure it's incorrect, but I'd have chosen differently.

The problem is that I read the emails from my friends using their voice and speaking style.

I'd do the same with HN comments, but I never heard most (any?) of them. Anyway, each commenter has a personal style, or at least I have an informal list in my head of a few hundreds commenters. I remember someone made a few good comments about some topic, so it adds in my mind weight to their opinion. I remember some details of their lives, like where they live, family, work, unusual past events, which topics they are interested, ..., they are persons!

With too much AI, comments get bland. They all read like the same corporate speak. AI would not add pasta recipes to antirez comments, or yadayada to patio11 comments. Also, the topics I'd trust their opinions are very different.

I don't mind using AI to fix the text. Moreover, in one of my previous comments I recomendad to write it in Gmail. I guess Gmail is using a mix of an expert system and modern AI. I hope someday Google adds that feature to the textbox in Chrome.

The problem is that some people is using AI to write short "somewhat related" comments, that are not wrong but not very relevant. Also to write giant "walls of text" that discuss the topic and the 5 most important ramifications. So there is an overreaction to correct orthography, grammar and "AI style".

> The fact that some people on this platform get irritated by their use just means they’ll eventually become the dinosaurs of the future.

Remember that birds are dinosaurs. And if you think that nobody is scared of birds, you should visit a pen full of rheas (ostrich are a fine substitution). If you have any brilliant ornament on your cloth they will try to eat it and you will be hit by the peak. Also they will steal food from your hands and it hurts. We visit an open zoo with my older daughter when she was a kid. Rheas were locked inside a pen for security reasons, there were a lot of ducks and baby ducks that are cute, and the goose were scary because they are evil and come in organized groups to "ask" for food.

↑