How large are large language models?

(gist.github.com)

Show context

ljoshua ◴[02 Jul 25 13:00 UTC] No.44443222[source]▶

>>44442072 (OP) #

Less a technical comment and more just a mind-blown comment, but I still can’t get over just how much data is compressed into and available in these downloadable models. Yesterday I was on a plane with no WiFi, but had gemma3:12b downloaded through Ollama. Was playing around with it and showing my kids, and we fired history questions at it, questions about recent video games, and some animal fact questions. It wasn’t perfect, but holy cow the breadth of information that is embedded in an 8.1 GB file is incredible! Lossy, sure, but a pretty amazing way of compressing all of human knowledge into something incredibly contained.

replies(22): >>44443263 #>>44443274 #>>44443296 #>>44443751 #>>44443781 #>>44443840 #>>44443976 #>>44444227 #>>44444418 #>>44444471 #>>44445299 #>>44445966 #>>44446013 #>>44446775 #>>44447373 #>>44448218 #>>44448315 #>>44448452 #>>44448810 #>>44449169 #>>44449182 #>>44449585 #

1. rain1 ◴[02 Jul 25 13:06 UTC] No.44443274[source]▶

>>44443222 #

It's extremely interesting how powerful a language model is at compression.

When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.

There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.

This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!

replies(2): >>44443457 #>>44444415 #

2. MPSimmons ◴[02 Jul 25 13:27 UTC] No.44443457[source]▶

>>44443274 (TP) #

Agreed. It's basically lossy compression for everything it's ever read. And the quantization impacts the lossiness, but since a lot of text is super fluffy, we tend not to notice as much as we would when we, say, listen to music that has been compressed in a lossy way.

replies(2): >>44446147 #>>44450147 #

3. soulofmischief ◴[02 Jul 25 14:45 UTC] No.44444415[source]▶

>>44443274 (TP) #

Knowledge is learning relationships by decontextualizing information into generalized components. Application of knowledge is recontextualizing these components based on the problem at hand.

This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".

A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.

Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/

replies(1): >>44463808 #

4. entropicdrifter ◴[02 Jul 25 17:05 UTC] No.44446147[source]▶

>>44443457 #

It's a bit like if you trained a virtual band to play any song ever, then told it to do its own version of the songs. Then prompted it to play whatever specific thing you wanted. It won't be the same because it kinda remembers the right thing sorta, but it's also winging it.

5. arcticbull ◴[03 Jul 25 00:00 UTC] No.44450147[source]▶

>>44443457 #

I've been referring to LLMs as JPEG for all the world's data, and people have really started to come around to it. Initially most folks tended to outright reject this comparison.

replies(1): >>44450341 #

6. simonw ◴[03 Jul 25 00:31 UTC] No.44450341{3}[source]▶

>>44450147 #

Ted Chiang wrote a great piece about that: https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

I think it's a solid description for a raw model, but it's less applicable once you start combining an LLM with better context and tools.

What's interesting to me isn't the stuff the LLM "knows" - it's how well an LLM system can serve me when combined with RAG and tools like web search and access to a compiler.

The most interesting developments right now are models like Gemma 3n which are designed to have as much capability as possible without needing a huge amount of "facts" baked into them.

7. TeMPOraL ◴[04 Jul 25 12:07 UTC] No.44463808[source]▶

>>44444415 #

And of course, once we extended lossy compression to make use of the semantic space, we started getting compression artifacts in semantic space - aka "hallucinations".

replies(1): >>44466750 #

8. soulofmischief ◴[04 Jul 25 18:18 UTC] No.44466750{3}[source]▶

>>44463808 #

That seems worthy of a blog post!

replies(1): >>44471842 #

9. TeMPOraL ◴[05 Jul 25 11:04 UTC] No.44471842{4}[source]▶

>>44466750 #

I don't know, it's not that profound of an insight. You throw away color information, the image gets blocky. You throw away frequency information, the image gets blurry. You throw away semantic information, shit stops making sense :).

Still, if someone would turn that into a blog post, I'd happily read it.

↑