←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 1 comments | | HN request time: 0.237s | source
Show context
modeless ◴[] No.45644207[source]
Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of academic papers as a data source.

Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.

replies(1): >>45658619 #
1. polytely ◴[] No.45658619[source]
If we get ocr working it makes it possible to store all human knowledge now stored in PDF's with way less resources

https://annas-archive.org/blog/critical-window.html