If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.
Yet this is what happens - the distilled or quantized models often come very close to the original model.
So I think there are still many low-hanging fruits to pick.
The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.
That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
(discussed here: https://news.ycombinator.com/item?id=34724477 )
Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).
The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.
My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.
As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...
I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.
So it's lossy all the way down with LLMs, too.
Reality > Data created by a human > LLM > Distilled LLM
Not my question to answer, I think that lies in philosophical questions about what is a "law".
I see useful abstractions all the way down. The linked Asimov essay covers this nicely.
The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.
(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)
Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.
Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.
In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.
We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.
Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.
Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.
To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.
So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.
More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.
Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.
There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.
In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.
In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.
Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.
That is why I argue that for Newtonian mechanics two principles are sufficient.
The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy
The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198
The work-energy theorem is the most important theorem of classical mechanics.
About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.
When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.
The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.
When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.
This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.
The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.
I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.
When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.
Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.
Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.