If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.
Yet this is what happens - the distilled or quantized models often come very close to the original model.
So I think there are still many low-hanging fruits to pick.
The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.
That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.
My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.
As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...
I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.
The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.
(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)