How large are large language models?

(gist.github.com)

262 points rain1 | 1 comments | 02 Jul 25 10:39 UTC | HN request time: 0.523s | source

Show context

mjburgess ◴[02 Jul 25 11:18 UTC] No.44442335[source]▶

>>44442072 (OP) #

Deepseek v1 is ~670Bn which is ~1.4TB physical.

All digitized books ever written/encoded compress to a few TB. The public web is ~50TB. I think a usable zip of all english electronic text publicly available would be on O(100TB). So we're at about 1% of that in model size, and we're in a diminishing-returns area of training -- ie., going to >1% has not yielded improvements (cf. gpt4.5 vs 4o).

This is why compute spend is moving to inference time with "reasoning" models. It's likely we're close to diminshing returns on inference-time compute now too, hence agents whereby (mostly,) deterministic tools are supplementing information /capability into the system.

I think to get any more value out of this model class, we'll be looking at domain-specific specialisation beyond instruction fine-tuning.

I'd guess targeting 1TB inference-time VRAM would be a reasonable medium-term target for high quality open source models -- that's within the reach of most SMEs today. That's about 250bn params.

replies(9): >>44442404 #>>44442633 #>>44442696 #>>44443009 #>>44443088 #>>44443188 #>>44443289 #>>44444740 #>>44449842 #

fouc ◴[02 Jul 25 12:57 UTC] No.44443188[source]▶

>>44442335 #

Maybe you're thinking of Library of Congress when you say ~50TB? Internet is definitely larger..

replies(1): >>44449735 #

1. Aachen ◴[02 Jul 25 23:02 UTC] No.44449735[source]▶

>>44443188 #

Indeed, a quick lookup doesn't give many reliable-sounding sources but they're all on the order of zettabytes (tens to thousands of them), also for years before any LLM was halfway usable. One has to wonder how much of that is generated, thinking of point of my own websites where the pages are derived statistics from player highscores, or the websites that jokingly index all Bitcoin addresses and UUIDs

Perhaps the 50TB estimate is unique information without any media or so, but OP can back up where they got that number from than I can do with guesswork

↑