Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

(github.com)

170 points ses425500000 | 2 comments | 05 Apr 25 05:22 UTC | HN request time: 0.001s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context

bonoboTP ◴[05 Apr 25 16:57 UTC] No.43594848[source]▶

>>43590998 (OP) #

LLMs for OCR is super risky because just as much as they can fix OCR mistakes, they can inadvertently "fix" correct stuff too and hallucinate instead.

Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...

I'd want to see some proper hallucination analysis.

replies(3): >>43595828 #>>43598188 #>>43598881 #

1. ses425500000 ◴[06 Apr 25 01:20 UTC] No.43598188[source]▶

>>43594848 #

Yeah, hallucination part was also one thing I was worry about. So I make LLM only run after OCR step, and I put simple check to not change correct text. I will try to show real examples and hallucination rate too. Thanks for feedback!

This project was just hobby and my first time posting something. I didn’t imagine people would care this much… Next time I will prepare better before sharing.

replies(1): >>43601113 #

2. bonoboTP ◴[06 Apr 25 12:55 UTC] No.43601113[source]▶

>>43598188 (TP) #

I didn't mean to target you specifically, just the general idea/trend of applying "smart priors" to do OCR. That is, a system that has a concept of what's plausible and may make the content more "plausible" instead of accurate. For example, an OCR system should be required to exactly recognize characters one by one, even including the typos. Sometimes even the presence of a comma or a small spelling variation can have significance. Or imagine running financial accounting stuff through LLM-OCR. And if you ask why would you OCR that instead of keeping digital records -- well, the real world can be very unreasonable and incompetent, and there are cases when e.g. the government only releases scanned PDFs on official sites regarding financial audit statistics etc.

↑