DeepSeek OCR | slacker news

1. ellisd ◴[20 Oct 25 08:22 UTC] No.45641234[source]▶

The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

replies(5): >>45641927 #>>45642797 #>>45642836 #>>45643509 #>>45644415 #

2. throawayonthe ◴[20 Oct 25 09:41 UTC] No.45641927[source]▶

>>45641234 (TP) #

hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released

3. singularfutur ◴[20 Oct 25 11:43 UTC] No.45642797[source]▶

>>45641234 (TP) #

Yes it means they will never release their dataset :(

4. ikamm ◴[20 Oct 25 11:48 UTC] No.45642836[source]▶

>>45641234 (TP) #

Why do they need to grant access for people to use copies of books they don’t own?

replies(2): >>45643228 #>>45653004 #

5. JohnLocke4 ◴[20 Oct 25 12:39 UTC] No.45643228[source]▶

>>45642836 #

Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from the people they choose to share it with. This is to improve their existing service by making the content of books (and not just their title/tags) searchable.

As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.

replies(1): >>45644161 #

6. dev1ycan ◴[20 Oct 25 13:13 UTC] No.45643509[source]▶

>>45641234 (TP) #

Oh great so now Anna's archive will get taken down as well by another trash LLM provider abusing repositories that students and researchers use, META torrenting 70TB from library genesis wasn't enough

replies(4): >>45643563 #>>45643595 #>>45643640 #>>45643646 #

7. c0balt ◴[20 Oct 25 13:22 UTC] No.45643595[source]▶

>>45643509 #

It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to satisfy the request safely.

8. ◴[20 Oct 25 13:26 UTC] No.45643640[source]▶

>>45643509 #

9. sigmoid10 ◴[20 Oct 25 13:27 UTC] No.45643646[source]▶

>>45643509 #

Seems like they are doing fine:

https://open-slum.org

replies(1): >>45667295 #

10. ikamm ◴[20 Oct 25 14:15 UTC] No.45644161{3}[source]▶

>>45643228 #

Fair enough, it just seems like they're painting an even bigger target on their backs by restricting access to copyrighted material they don't own the rights to

11. bluecoconut ◴[20 Oct 25 14:38 UTC] No.45644415[source]▶

>>45641234 (TP) #

Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

12. est ◴[21 Oct 25 06:22 UTC] No.45653004[source]▶

>>45642836 #

> The books from Duxiu have long been pirated on the Chinese internet. Usually they are being sold for less than a dollar by resellers. They are typically distributed using the Chinese equivalent of Google Drive, which has often been hacked to allow for more storage space

Ownership laundering.

13. dev1ycan ◴[22 Oct 25 10:58 UTC] No.45667295{3}[source]▶

>>45643646 #

Yeah, for now, Meta torrented 70TB and right after that they cut the rope for everyone else, mysteriously their hitman (US govenrment) hit both Libgen and Z-Lib shortly after.