←back to thread

262 points rain1 | 3 comments | | HN request time: 0.767s | source
Show context
ljoshua ◴[] No.44443222[source]
Less a technical comment and more just a mind-blown comment, but I still can’t get over just how much data is compressed into and available in these downloadable models. Yesterday I was on a plane with no WiFi, but had gemma3:12b downloaded through Ollama. Was playing around with it and showing my kids, and we fired history questions at it, questions about recent video games, and some animal fact questions. It wasn’t perfect, but holy cow the breadth of information that is embedded in an 8.1 GB file is incredible! Lossy, sure, but a pretty amazing way of compressing all of human knowledge into something incredibly contained.
replies(22): >>44443263 #>>44443274 #>>44443296 #>>44443751 #>>44443781 #>>44443840 #>>44443976 #>>44444227 #>>44444418 #>>44444471 #>>44445299 #>>44445966 #>>44446013 #>>44446775 #>>44447373 #>>44448218 #>>44448315 #>>44448452 #>>44448810 #>>44449169 #>>44449182 #>>44449585 #
nico ◴[] No.44444418[source]
For reference (according to Google):

> The English Wikipedia, as of June 26, 2025, contains over 7 million articles and 63 million pages. The text content alone is approximately 156 GB, according to Wikipedia's statistics page. When including all revisions, the total size of the database is roughly 26 terabytes (26,455 GB)

replies(3): >>44444951 #>>44448715 #>>44448846 #
pcrh ◴[] No.44448846[source]
Wikipedia itself describes its size as ~25GB without media [0]. And it's probably more accurate and with broader coverage in multiple languages compared to the LLM downloaded by the GP.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

replies(1): >>44449192 #
1. pessimizer ◴[] No.44449192[source]
Really? I'd assume that an LLM would deduplicate Wikipedia into something much smaller than 25GB. That's its only job.
replies(1): >>44449683 #
2. crazygringo ◴[] No.44449683[source]
> That's its only job.

The vast, vast majority of LLM knowledge is not found in Wikipedia. It is definitely not its only job.

replies(1): >>44449902 #
3. Tostino ◴[] No.44449902[source]
When trained on next word prediction with the standard loss function, by definition it is it's only job.