How large are large language models?

(gist.github.com)

262 points rain1 | 1 comments | 02 Jul 25 10:39 UTC | HN request time: 0s | source

Show context

mjburgess ◴[02 Jul 25 11:18 UTC] No.44442335[source]▶

>>44442072 (OP) #

Deepseek v1 is ~670Bn which is ~1.4TB physical.

All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o).

This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system.

I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.

I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.

replies(9): >>44442404 #>>44442633 #>>44442696 #>>44443009 #>>44443088 #>>44443188 #>>44443289 #>>44444740 #>>44449842 #

account-5 ◴[02 Jul 25 11:28 UTC] No.44442404[source]▶

>>44442335 #

> All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB).

Where you getting these numbers from? Interested to see how that's calculated.

I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

replies(6): >>44442434 #>>44442485 #>>44442551 #>>44442770 #>>44443245 #>>44462214 #

TeMPOraL ◴[02 Jul 25 12:13 UTC] No.44442770[source]▶

>>44442404 #

> I read somewhere, but cannot find the source anymore, that all written text prior to this century was approx 50MB. (Might be misquoted as don't have source anymore).

50 MB feels too low, unless the quote meant text up until the 20th century, in which case it feels much more believable. In terms of text production and publishing, we're still riding an exponent, so a couple orders of magnitude increase between 1899 and 2025 is not surprising.

(Talking about S-curves is all the hotness these days, but I feel it's usually a way to avoid understanding what exponential growth means - if one assumes we're past the inflection point, one can wave their hands and pretend the change is linear, and continue to not understand it.)

replies(2): >>44443753 #>>44443771 #

1. jerf ◴[02 Jul 25 13:54 UTC] No.44443771[source]▶

>>44442770 #

50MB feels like "all the 'ancient' text we have" maybe, as measured by the size of the original content and not counting copies. A quick check at Alice in Wonderland puts it at 163kB in plain text. About 300 of those gets us to 50MB. There's way more than 300 books of similar size from the 19th century. They may not all be digitized and freely available, but you can fill libraries with even existing 19th century texts, let alone what may be lost by now.

Or it may just be someone bloviating and just being wrong... I think even ancient texts could exceed that number, though perhaps not by an order of magnitude.

↑