←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 2 comments | | HN request time: 0.421s | source
1. modeless ◴[] No.45644207[source]
Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of academic papers as a data source.

Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.

replies(1): >>45658619 #
2. polytely ◴[] No.45658619[source]
If we get ocr working it makes it possible to store all human knowledge now stored in PDF's with way less resources

https://annas-archive.org/blog/critical-window.html