←back to thread

170 points ses425500000 | 2 comments | | HN request time: 0.465s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context
jlcases ◴[] No.43592934[source]
This is a valuable contribution. The quality of ML models heavily depends on the quality of training data, and extracting structured information from unstructured documents (like PDFs) is a critical bottleneck.

A key challenge after OCR is organizing the extracted data into a coherent knowledge structure. We've seen significant improvements in downstream ML tasks when the extracted data is organized using a hierarchical, MECE (Mutually Exclusive, Collectively Exhaustive) framework. This ensures that relationships between entities (tables, diagrams, text) are explicitly captured.

Does your pipeline include capabilities for semantic structuring of the extracted content beyond basic layout analysis? That seems like the next frontier for maximizing the value of OCR data in ML training.

replies(1): >>43593169 #
ses425500000 ◴[] No.43593169[source]
Thanks for the insightful comment! You’re absolutely right — organizing extracted data into a coherent, semantically meaningful structure is critical for high-quality ML training.

Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.

I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.

Thanks again for the thoughtful feedback!

replies(1): >>43594077 #
cAtte_ ◴[] No.43594077[source]
why are you using an LLM to reply to every comment?
replies(2): >>43594134 #>>43597481 #
vo2maxer ◴[] No.43597481[source]
Genuinely curious—could it be for the same reason you used a keyboard to write that comment? It’s efficient, it works. What’s the actual issue with using a tool that helps convey the intended message more clearly and quickly, as long as it reflects what he wanted to say?
replies(1): >>43598698 #
1. cAtte_ ◴[] No.43598698[source]
why are you offended on behalf of this person? the hindsight that they're simply an English learner obviously makes me feel bad for asking the question and i completely understand the use case, but i don't think it was unreasonable to think that someone who speaks entirely in ChatGPT paragraphs might be a bot, spammer, or the like—particularly because, in a botnet fashion, the original reply was to a comment that also seemed to be LLM-authored
replies(1): >>43598982 #
2. vo2maxer ◴[] No.43598982[source]
I wasn't offended at all. I was just genuinely curious, because I keep coming across this assumption that if any text is well-crafted, it must have come from an LLM. I think I understand why: we've grown so used to reading sloppy writing, everything from barely coherent text messages to articles in reputable publications filled with typos and awkward phrasing.

Personally, I've always held myself to a high standard in how I write, even in text messages. Some might see that as bordering on perfectionism, but for me, it's about respecting the principle behind communication: to be as clear and correct as possible.

Now that we have tools that help ensure that clarity, or at the very least, reduce distractions caused by grammar or spelling mistakes, of course I'm going to use them. I used to agonize over my comments on Twitter because you couldn't edit them after posting. I would first write them elsewhere and review them several times for any errors before finally posting. For context: I'm a retired 69-year-old physician, and even after witnessing decades of technological advancement, I'm still in awe of what this new technology can do.

Yes, I love beautiful, natural writing. I'm a voracious reader of the great classics. I regularly immerse myself in Shakespeare, Hardy, Eliot, Dickens, Dostoyevsky, Austen, Tolstoy, and many other literary masters. But I also fully embrace this tool that can elevate even the clumsiest writer's work to a clarity we've never had access to before. If that comes at the cost of a bit of stylistic uniformity, that's a reasonable trade-off. It's up to the user to shape the output, review it, and make sure their own voice and ideas shine through.

Back to your original point, I truly wasn't offended on his behalf. I was just curious. As it turns out, he was using an LLM, because his native language is Korean. Good for him. And just to be clear, I didn't intend to make your question seem inappropriate or to embarrass him in any way. If it came across that way, I apologize.