When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.
There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.
This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!
I think it's a solid description for a raw model, but it's less applicable once you start combining an LLM with better context and tools.
What's interesting to me isn't the stuff the LLM "knows" - it's how well an LLM system can serve me when combined with RAG and tools like web search and access to a compiler.
The most interesting developments right now are models like Gemma 3n which are designed to have as much capability as possible without needing a huge amount of "facts" baked into them.