(github.com)

170 points ses425500000 | 2 comments | 05 Apr 25 05:22 UTC | HN request time: 0.412s | source

Hi HN,

I’ve been working on an OCR pipeline specifically optimized for machine learning dataset preparation. It’s designed to process complex academic materials — including math formulas, tables, figures, and multilingual text — and output clean, structured formats like JSON and Markdown.

Some features: • Multi-stage OCR combining DocLayout-YOLO, Google Vision, MathPix, and Gemini Pro Vision • Extracts and understands diagrams, tables, LaTeX-style math, and multilingual text (Japanese/Korean/English) • Highly tuned for ML training pipelines, including dataset generation and preprocessing for RAG or fine-tuning tasks

Sample outputs and real exam-based examples are included (EJU Biology, UTokyo Math, etc.) Would love to hear any feedback or ideas for improvement.

GitHub: https://github.com/ses4255/Versatile-OCR-Program

Show context

bonoboTP ◴[05 Apr 25 16:57 UTC] No.43594848[source]▶

>>43590998 (OP) #

LLMs for OCR is super risky because just as much as they can fix OCR mistakes, they can inadvertently "fix" correct stuff too and hallucinate instead.

Its that xerox bug on steroids, where scanned pages would get their digits swapped by other digits...

I'd want to see some proper hallucination analysis.

replies(3): >>43595828 #>>43598188 #>>43598881 #

sureglymop ◴[05 Apr 25 18:52 UTC] No.43595828[source]▶

>>43594848 #

Also, what about prompt injection? With an LLM as far as I'm aware there is never a clear separation between instruction and the data to be processed.

replies(1): >>43598197 #

1. ses425500000 ◴[06 Apr 25 01:22 UTC] No.43598197[source]▶

>>43595828 #

Yeah, prompt injection is good point. For now, I try separate instruction and data by using JSON format, and run it in sandbox. Not perfect maybe, but I will try add small explanation in README so people can check it better.

replies(1): >>43613792 #

2. sureglymop ◴[07 Apr 25 17:21 UTC] No.43613792[source]▶

>>43598197 (TP) #

In this case the result/output is plain text. Since it's not code it may be harder to imagine an attack vector. As an attacker, here would be some of my capabilities/possibilities:

- I could change the meaning of the output and the output entirely. - If I can control one part of a larger set of data that is analyzed , I could influence the whole output. - I could try to make the process take forever in order to waste resources.

I'd say the first scenario is most interesting, especially if I could then potentially also influence how an LLM trained on the output behaves and do even more damage using this down the line.

Let's say I'm a disgruntled website author. I want my users to see correct information on my website but don't want any LLM to be trained on it. In this case I could probably successfully use prompt injection to "poison" the model.

↑

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)