If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.
Yet this is what happens - the distilled or quantized models often come very close to the original model.
So I think there are still many low-hanging fruits to pick.
We do understand how they work, we just have not optimised their usage.
For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.
But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.
Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.
Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.
Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."
The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.
LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.
That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.
Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.
I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.
The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.
(discussed here: https://news.ycombinator.com/item?id=34724477 )
> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks
I say this less as an authoritative voice but more as an amused insider: Spend a week with some ML grad students and you will get a chuckle whenever somebody says we’re not some monkeys throwing things at GPUs.
With neural networks big or small, we got no clue what’s going on. You can observe the whole system, from the weights and biases, to the activations, gradients, etc and still get nothing.
On the other hand, one of the reasons why economics, psychology and politics are hard is because we can’t open up people’s heads and define and measure what they’re thinking.
Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).
https://youtube.com/shorts/7GrecDNcfMc
Many many layers of that. It’s not a profound mechanism. We can understand how that works, but we’re dumbfounded how such a small mechanism is responsible for all this stuff going on inside a brain.
I don’t think we don’t understand, it’s a level beyond that. We can’t fathom the implications, that it could be that simple, just scaled up.
Why? In the spiritual realm, many postulated that even the Elephant you never met is part of your life.
None of this is a coincidence.
The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.
My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.
As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...
I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.
Just like alchemists made enormous strides in chemistry, but their goal was to turn piss into gold.
Bad argument. Cavemen understood stone, but they could not build the aqueducts. Medieval people understood iron, water and fire but they could not make a steam engine
Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make
How far removed are you from a caveman is the better question. There would be quite some arrogance coming out of you to suggest the several million years gap is anything but an instant in the grand timeline. As in, you understood stone just yesterday ...
The monkey that found the stone is the monkey that built the cathedral. It's only a delusion the second monkey creates to separate it from the first monkey (a feeling of superiority, with the only tangible asset being "a certain amount of notable time passed since point A and point B").
"Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make"
You and I agree. That those simple things can truly create infinite possibilities. That's all I was saying, we cannot fathom it (either because infinity is hard to fathom, or that it's origins are humble - just a few core elements, or both, or something else).
Anyway, this can discussion can head into any direction.
"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.
Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.
(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")
So it's lossy all the way down with LLMs, too.
Reality > Data created by a human > LLM > Distilled LLM
Not my question to answer, I think that lies in philosophical questions about what is a "law".
I see useful abstractions all the way down. The linked Asimov essay covers this nicely.
The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.
(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)
Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.
Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.
In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.
We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.
Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.
Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.
To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.
So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.
More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.
Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.
There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.
In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.
In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.
Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.
That is why I argue that for Newtonian mechanics two principles are sufficient.
The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy
The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198
The work-energy theorem is the most important theorem of classical mechanics.
About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.
When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.
The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.
When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.
This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.
The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.
I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.
When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.
Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.
Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.
For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.
I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).
This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.