Most active commenters

    ←back to thread

    262 points rain1 | 30 comments | | HN request time: 1.412s | source | bottom
    1. mjburgess ◴[] No.44442335[source]
    Deepseek v1 is ~670Bn which is ~1.4TB physical.

    All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o).

    This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system.

    I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.

    I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.

    replies(9): >>44442404 #>>44442633 #>>44442696 #>>44443009 #>>44443088 #>>44443188 #>>44443289 #>>44444740 #>>44449842 #
    2. account-5 ◴[] No.44442404[source]
    > All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB).

    Where you getting these numbers from? Interested to see how that's calculated.

    I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

    replies(6): >>44442434 #>>44442485 #>>44442551 #>>44442770 #>>44443245 #>>44462214 #
    3. WesolyKubeczek ◴[] No.44442434[source]
    Maybe prior to the prior century, and even then I smell a lot of bullshit. I mean, just look at the Project Gutenberg. Even plaintext only, even compressed.
    replies(1): >>44443280 #
    4. kmm ◴[] No.44442485[source]
    Perhaps that's meant to be 50GB (and that still seems like a serious underestimation)? Just the Bible is already 5MB.
    replies(1): >>44442513 #
    5. _Algernon_ ◴[] No.44442513{3}[source]
    English Wikipedia without media alone is ~24 GB compressed.

    https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

    replies(1): >>44444121 #
    6. mjburgess ◴[] No.44442551[source]
    Anna's Archive full torrent is O(1PB), project gutenberg is O(1TB), many AI training torrents are reported in the O(50TB) range.

    Extract just the plain text from that (+social media, etc.), remove symbols outside of a 64 symbol alphabet (6 bits) and compress. "Feels" to me around a 100TB max for absolutely everything.

    Either way, full-fat LLMs are operating at 1-10% of this scale, depending how you want to estimate it.

    If you run a more aggressive filter on that 100TB, eg., for a more semantic dedup, there's a plausible argument for "information" in english texts available being ~10TB -- then we're running close to 20% of that in LLMs.

    If we take LLMs to just be that "semantic compression algorithm", and supposing the maximum useful size of an LLM is 2TB, then you could run the argument that everything "salient" ever written is <10TB.

    Taking LLMs to be running at close-to 50% "everything useful" rather than 1% would be a explanation of why training has capped out.

    I think the issue is at least as much to do with what we're using LLMs for -- ie., instruction fine-tuning requires some more general (proxy/quasi-) semantic structures in LLMs and I think you only need O(1%) of "everything ever written" to capture these. So it wouldnt really matter how much more we added, instruction-following LLMs don't really need it.

    7. smokel ◴[] No.44442633[source]
    Simply add images and video, and these estimates start to sound like the "640 KB should be enough for everyone".

    After that, make the robots explore and interact with the world by themselves, to fetch even more data.

    In all seriousness, adding image and interaction data will probably be enormously useful, even for generating text.

    replies(1): >>44443244 #
    8. generalizations ◴[] No.44442696[source]
    > has not yielded improvements (cf. gpt4.5 vs 4o).

    FWIW there is a huge difference between 4.5 and 4o.

    9. TeMPOraL ◴[] No.44442770[source]
    > I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

    50 MB feels too low, unless the quote meant text up until the 20th century, in which case it feels much more believable. In terms of text production and publishing, we're still riding an exponent, so a couple orders of magnitude increase between 1899 and 2025 is not surprising.

    (Talking about S-curves is all the hotness these days, but I feel it's usually a way to avoid understanding what exponential growth means - if one assumes we're past the inflection point, one can wave their hands and pretend the change is linear, and continue to not understand it.)

    replies(2): >>44443753 #>>44443771 #
    10. charcircuit ◴[] No.44443009[source]
    >The public web is ~50TB

    Did you mean to type EB?

    replies(1): >>44443291 #
    11. andrepd ◴[] No.44443088[source]
    > 50TB

    There's no way the entire Web fits in 400$ worth of hard drives.

    replies(2): >>44443160 #>>44443206 #
    12. AlienRobot ◴[] No.44443160[source]
    Text is small.
    13. fouc ◴[] No.44443188[source]
    Maybe you're thinking of Library of Congress when you say ~50TB? Internet is definitely larger..
    replies(1): >>44449735 #
    14. flir ◴[] No.44443206[source]
    Nah, Common Crawl puts on 250TB a month.

    Maybe text only, though...

    15. netcan ◴[] No.44443244[source]
    Like both will be done. Idk what the roi is on adding video data to the text models, but it's presumably lower than text.

    There are just a lot of avenues to try at this point.

    replies(1): >>44446677 #
    16. bravesoul2 ◴[] No.44443245[source]
    I reckon a prolific writer could publish a million words in their career.

    Most people who blog could wrote 1k words a day. That's a million in 3 years. So not crazy numbers here.

    That's 5Mb. Maybe you meant 50Gb. I'd hazard 50Tb.

    17. bravesoul2 ◴[] No.44443280{3}[source]
    Even Shakespeare alone needs 4 floppy disks.
    18. rain1 ◴[] No.44443289[source]
    This is kind of related to the jack morris post https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only he discusses how the big leaps in LLMs have mostly come - not so much from new training methods or arch. changes as such - but the ability of new archs. to ingest more data.
    19. gosub100 ◴[] No.44443291[source]
    Only if you included all images and video
    20. ben_w ◴[] No.44443753{3}[source]
    Even by the start of the 20th century, 50 MB is definitely far too low.

    Any given English translation of Bible is by itself something like 3-5 megabytes of ASCII; the complete works of Shakespeare are about 5 megabytes; and I think (back of the envelope estimate) you'd get about the same again for what Arthur Conan Doyle wrote before 1900.

    I can just about believe there might have been only ten thousand Bible-or-Shakespeare sized books (plus all the court documents, newspapers, etc. that add up to that) worldwide by 1900, but not ten.

    Edit: I forgot about encyclopaedias, by 1900 the Encyclopædia Britannica was almost certainly more than 50 MB all by itself.

    replies(1): >>44454930 #
    21. jerf ◴[] No.44443771{3}[source]
    50MB feels like "all the 'ancient' text we have" maybe, as measured by the size of the original content and not counting copies. A quick check at Alice in Wonderland puts it at 163kB in plain text. About 300 of those gets us to 50MB. There's way more than 300 books of similar size from the 19th century. They may not all be digitized and freely available, but you can fill libraries with even existing 19th century texts, let alone what may be lost by now.

    Or it may just be someone bloviating and just being wrong... I think even ancient texts could exceed that number, though perhaps not by an order of magnitude.

    22. kmm ◴[] No.44444121{4}[source]
    I don't see how the size of Wikipedia has any bearing on the 50MB figure given for pre-20th century literature by the parent.
    23. layer8 ◴[] No.44444740[source]
    Just a nitpick, but please don’t misuse big O notation like that. Any fixed storage amount is O(100TB).
    24. llSourcell ◴[] No.44446677{3}[source]
    no its not lower than text, its higher ROI than text for understanding the physics of the world, which is exactly what videos are better at than text when it comes to training data
    replies(1): >>44448134 #
    25. AstroBen ◴[] No.44448134{4}[source]
    Does that transfer, though? I'm not sure we can expect its ability to approximate physics in video form would transfer to any other mode (text, code, problem solving etc)
    replies(1): >>44448595 #
    26. ricopags ◴[] No.44448595{5}[source]
    depends on the hyperparams but one of the biggest benefits of a latent space is transfer between modalities
    27. Aachen ◴[] No.44449735[source]
    Indeed, a quick lookup doesn't give many reliable-sounding sources but they're all on the order of zettabytes (tens to thousands of them), also for years before any LLM was halfway usable. One has to wonder how much of that is generated, thinking of point of my own websites where the pages are derived statistics from player highscores, or the websites that jokingly index all Bitcoin addresses and UUIDs

    Perhaps the 50TB estimate is unique information without any media or so, but OP can back up where they got that number from than I can do with guesswork

    28. camel-cdr ◴[] No.44449842[source]
    > All digitized books ever written/encoded compress to a few TB.

    I tied to estimate how much data this actually is:

        # annas archive stats
        papers = 105714890
        books = 52670695
        
        # word count estimates
        avrg_words_per_paper = 10000
        avrg_words_per_book = 100000
        
        words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
        
        # quick text of 27 million words from a few books
        sample_words = 27809550
        sample_bytes = 158824661
        sample_bytes_comp = 28839837 # using zpaq -m5
        
        bytes_per_word = sample_bytes/sample_words
        byte_comp_ratio = sample_bytes_comp/sample_bytes
        word_comp_ratio = bytes_per_word*byte_comp_ratio
        
        print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
        print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB
    
    
    So uncompressed ~30 TB and compressed ~5.5 TB of data.

    That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.

    29. TeMPOraL ◴[] No.44454930{4}[source]
    You and 'jerf make a fair point. Assuming you both are right, let's take jerf's estimate (which I now feel is right):

    > 50MB feels like "all the 'ancient' text we have" maybe, as measured by the size of the original content and not counting copies

    and yours - counting up court documents, newspapers, encyclopaedias, and I guess I'd add various letters to it (quite a lot survived to this day), and science[0], let's give it 1000x my estimate, so 50GB.

    For the present, comments upthread give estimates that are in hundreds of terabytes to petabyte range. I'd say that, including deduplication, 50TB would be a conservative value. That's still 1000x of what you estimate for year 1900!

    The exponent is going strong.

    Thanks both of you for giving me a better picture of it.

    --

    [0] - I entirely forgot about https://en.wikipedia.org/wiki/Royal_Society!

    30. zX41ZdbW ◴[] No.44462214[source]
    I've recently made a presentation on this topic: https://www.youtube.com/watch?v=8yH3rY1fZEA