Did that metric meaningfully change when the amount of required memory dropped?
If the amount of diversity is lowered, I would expect that to lower the amount of patterns to be modeled from the text. If that is the case, then the resulting model size itself would be lowered, during and after training.
H_s(x) := -\sum_{x \in X_s} p(x) log(p(x))
where X_s := all s-grams from the training set? That seems like it would eventually become hard to impossible to actually compute. Even if you could what would it tell you?Or, wait... are you referring to running such an analysis on the output of the model? Yeah, that might prove interesting....
Because the text we write is not evenly distributed random noise, what we encode into it (by writing) is entropy.
Because LLMs model text with inference, they model all of the entropy that is present.
That would mean that the resulting size would be a measure of entropy (sum of patterns) divided by repetition (recurring patterns). In this count, I would consider each unique token alone an instance of the identity pattern.
So to answer both questions: yes.