(github.com)

170 points ses425500000 | 1 comments | 05 Apr 25 05:22 UTC | HN request time: 0.329s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context

bonoboTP ◴[05 Apr 25 16:57 UTC] No.43594848[source]▶

>>43590998 (OP) #

LLMs for OCR is super risky because just as much as they can fix OCR mistakes, they can inadvertently "fix" correct stuff too and hallucinate instead.

Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...

I'd want to see some proper hallucination analysis.

replies(3): >>43595828 #>>43598188 #>>43598881 #

1. fnordpiglet ◴[06 Apr 25 04:04 UTC] No.43598881[source]▶

>>43594848 #

I use tesseract which uses a LTSM OCR along with multimodal LLMs to converge to a ground truth. It works remarkably well. However for my purposes I don’t want a LLM explaining charts I want it to produce a vector format of the chart. There are a few models that produce Latex chart formats I’m experimenting with:

https://arxiv.org/pdf/2405.15306

Most OCR pipelines like this, along with excellent commercial ones like doctly.ai, are focused on OCR for LLM consumption - while I’d like to be able to recreate the original scientific work that predates digital typesetting in modern typeset - for yes LLM but also to preserve and promote science of yore, much of which includes discoveries forgotten but relevant still to problems we face today.

↑

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)