How large are large language models?

I wish people would stop parroting the view that LLMs are lossy compression.

There is kind of a vague sense in which this metaphor holds, but there is a much more interesting and rigorous fact about LLMs which is that they are also _lossless_ compression algorithms.

There are at least two senses in which this is true:

1. You can use an LLM to losslessly compress any piece of text at a cost that approaches the log-likelihood of that text under the model, using arithmetic coding. A sender and receiver both need a copy of the LLM weights.

2. You can use an LLM plus SGD (I.e the training code) as an lossless compression algorithm, where the communication cost is area under the training curve (and the model weights don’t count towards description length!) see: Jack Rae “compression for AGI”