←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 4 comments | | HN request time: 0.823s | source
Show context
ellisd ◴[] No.45641234[source]
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

replies(5): >>45641927 #>>45642797 #>>45642836 #>>45643509 #>>45644415 #
1. ikamm ◴[] No.45642836[source]
Why do they need to grant access for people to use copies of books they don’t own?
replies(2): >>45643228 #>>45653004 #
2. JohnLocke4 ◴[] No.45643228[source]
Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from the people they choose to share it with. This is to improve their existing service by making the content of books (and not just their title/tags) searchable.

As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.

replies(1): >>45644161 #
3. ikamm ◴[] No.45644161[source]
Fair enough, it just seems like they're painting an even bigger target on their backs by restricting access to copyrighted material they don't own the rights to
4. est ◴[] No.45653004[source]
> The books from Duxiu have long been pirated on the Chinese internet. Usually they are being sold for less than a dollar by resellers. They are typically distributed using the Chinese equivalent of Google Drive, which has often been hacked to allow for more storage space

Ownership laundering.