Most active commenters

    ←back to thread

    DeepSeek OCR

    (github.com)
    990 points pierre | 13 comments | | HN request time: 0.001s | source | bottom
    1. ellisd ◴[] No.45641234[source]
    The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

    https://annas-archive.org/blog/duxiu-exclusive.html

    replies(5): >>45641927 #>>45642797 #>>45642836 #>>45643509 #>>45644415 #
    2. throawayonthe ◴[] No.45641927[source]
    hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released
    3. singularfutur ◴[] No.45642797[source]
    Yes it means they will never release their dataset :(
    4. ikamm ◴[] No.45642836[source]
    Why do they need to grant access for people to use copies of books they don’t own?
    replies(2): >>45643228 #>>45653004 #
    5. JohnLocke4 ◴[] No.45643228[source]
    Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from the people they choose to share it with. This is to improve their existing service by making the content of books (and not just their title/tags) searchable.

    As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.

    replies(1): >>45644161 #
    6. dev1ycan ◴[] No.45643509[source]
    Oh great so now Anna's archive will get taken down as well by another trash LLM provider abusing repositories that students and researchers use, META torrenting 70TB from library genesis wasn't enough
    replies(4): >>45643563 #>>45643595 #>>45643640 #>>45643646 #
    7. c0balt ◴[] No.45643595[source]
    It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to satisfy the request safely.
    8. ◴[] No.45643640[source]
    9. sigmoid10 ◴[] No.45643646[source]
    Seems like they are doing fine:

    https://open-slum.org

    replies(1): >>45667295 #
    10. ikamm ◴[] No.45644161{3}[source]
    Fair enough, it just seems like they're painting an even bigger target on their backs by restricting access to copyrighted material they don't own the rights to
    11. bluecoconut ◴[] No.45644415[source]
    Previous paper from DeepSeek has mentioned Anna’s Archive.

    > We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

    12. est ◴[] No.45653004[source]
    > The books from Duxiu have long been pirated on the Chinese internet. Usually they are being sold for less than a dollar by resellers. They are typically distributed using the Chinese equivalent of Google Drive, which has often been hacked to allow for more storage space

    Ownership laundering.

    13. dev1ycan ◴[] No.45667295{3}[source]
    Yeah, for now, Meta torrented 70TB and right after that they cut the rope for everyone else, mysteriously their hitman (US govenrment) hit both Libgen and Z-Lib shortly after.