How large are large language models?

(gist.github.com)

262 points rain1 | 2 comments | 02 Jul 25 10:39 UTC | HN request time: 0.624s | source

Show context

ljoshua ◴[02 Jul 25 13:00 UTC] No.44443222[source]▶

>>44442072 (OP) #

Less a technical comment and more just a mind-blown comment, but I still can’t get over just how much data is compressed into and available in these downloadable models. Yesterday I was on a plane with no WiFi, but had gemma3:12b downloaded through Ollama. Was playing around with it and showing my kids, and we fired history questions at it, questions about recent video games, and some animal fact questions. It wasn’t perfect, but holy cow the breadth of information that is embedded in an 8.1 GB file is incredible! Lossy, sure, but a pretty amazing way of compressing all of human knowledge into something incredibly contained.

replies(22): >>44443263 #>>44443274 #>>44443296 #>>44443751 #>>44443781 #>>44443840 #>>44443976 #>>44444227 #>>44444418 #>>44444471 #>>44445299 #>>44445966 #>>44446013 #>>44446775 #>>44447373 #>>44448218 #>>44448315 #>>44448452 #>>44448810 #>>44449169 #>>44449182 #>>44449585 #

rain1 ◴[02 Jul 25 13:06 UTC] No.44443274[source]▶

>>44443222 #

It's extremely interesting how powerful a language model is at compression.

When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.

There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.

This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!

replies(2): >>44443457 #>>44444415 #

soulofmischief ◴[02 Jul 25 14:45 UTC] No.44444415[source]▶

>>44443274 #

Knowledge is learning relationships by decontextualizing information into generalized components. Application of knowledge is recontextualizing these components based on the problem at hand.

This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".

A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.

Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/

replies(1): >>44463808 #

TeMPOraL ◴[04 Jul 25 12:07 UTC] No.44463808[source]▶

>>44444415 #

And of course, once we extended lossy compression to make use of the semantic space, we started getting compression artifacts in semantic space - aka "hallucinations".

replies(1): >>44466750 #

soulofmischief ◴[04 Jul 25 18:18 UTC] No.44466750[source]▶

>>44463808 #

That seems worthy of a blog post!

replies(1): >>44471842 #

1. TeMPOraL ◴[05 Jul 25 11:04 UTC] No.44471842[source]▶

>>44466750 #

I don't know, it's not that profound of an insight. You throw away color information, the image gets blocky. You throw away frequency information, the image gets blurry. You throw away semantic information, shit stops making sense :).

Still, if someone would turn that into a blog post, I'd happily read it.

replies(1): >>44474931 #

2. soulofmischief ◴[05 Jul 25 19:28 UTC] No.44474931[source]▶

>>44471842 (TP) #

There's more to it than that. You can draw strong analogies and also discuss where the analogy suffers. For example, you can compare decreased performance with accurately recalling specific information with high-frequency attenuation in lossy codecs.

↑