(github.com)

990 points pierre | 1 comments | 20 Oct 25 06:26 UTC | HN request time: 0.211s | source

Show context

ellisd ◴[20 Oct 25 08:22 UTC] No.45641234[source]▶

The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

replies(5): >>45641927 #>>45642797 #>>45642836 #>>45643509 #>>45644415 #

1. bluecoconut ◴[20 Oct 25 14:38 UTC] No.45644415[source]▶

>>45641234 #

Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

↑

DeepSeek OCR