Most active commenters
  • t_mann(5)
  • bloomingkales(4)
  • gessha(3)
  • timschmidt(3)

←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 51 comments | | HN request time: 3.04s | source | bottom
Show context
mtrovo ◴[] No.42951263[source]
I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?
replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #
1. xg15 ◴[] No.42953577[source]
I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #
2. teruakohatu ◴[] No.42955228[source]
> still have no real comprehensive understanding how the models work.

We do understand how they work, we just have not optimised their usage.

For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

replies(3): >>42955842 #>>42955941 #>>42962716 #
3. spiorf ◴[] No.42955842[source]
We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.
replies(2): >>42958701 #>>43000550 #
4. gessha ◴[] No.42955941[source]
Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.

Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.

Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.

replies(2): >>42959322 #>>42960342 #
5. pertymcpert ◴[] No.42956999[source]
For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.
6. ZeljkoS ◴[] No.42957002[source]
We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

replies(2): >>42959347 #>>42965862 #
7. Valgrim ◴[] No.42958701{3}[source]
It feels less like a word prediction algorithm and more like a world model compression algorithm. Maybe we tried to create one and accidentaly created the other?
replies(2): >>42960470 #>>42962374 #
8. MR4D ◴[] No.42959159[source]
I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

replies(4): >>42959654 #>>42963668 #>>42966553 #>>43000430 #
9. brookst ◴[] No.42959322{3}[source]
Isn't that just scale? Even small LLMs have more parts than any car.

LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.

replies(2): >>42959929 #>>42961952 #
10. 3abiton ◴[] No.42959347[source]
So more 'mature' models might arise in the near future with less params and better benchmarks?
replies(3): >>42960280 #>>42960288 #>>42961518 #
11. umeshunni ◴[] No.42959654[source]
> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

replies(3): >>42960472 #>>42961599 #>>42962196 #
12. ChymeraXYZ ◴[] No.42959929{4}[source]
Could be, but it does not change the fact that we do not understand them as of now.
13. raducu ◴[] No.42960280{3}[source]
"Better", but not better than the model they were distilled from, at least that's how I understand it.
replies(1): >>42962035 #
14. andreasmetsala ◴[] No.42960288{3}[source]
They might also be more biased and less able to adapt to new technology. Interesting times.
15. raducu ◴[] No.42960342{3}[source]
> _fundamentally_ don’t understand how deep learning systems works.

It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.

Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.

I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.

The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.

replies(2): >>42961916 #>>42965302 #
16. codeulike ◴[] No.42960470{4}[source]
Its almost like a Model of Language, but very Large
17. kedarkhand ◴[] No.42960472{3}[source]
Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken
replies(1): >>42962058 #
18. coder543 ◴[] No.42961518{3}[source]
That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.
19. homarp ◴[] No.42961599{3}[source]
hence https://www.newyorker.com/tech/annals-of-technology/chatgpt-... (by Ted Chiang)

(discussed here: https://news.ycombinator.com/item?id=34724477 )

20. gessha ◴[] No.42961916{4}[source]
As a person who has trained a number of computer vision deep networks, I can tell you that we have some cool-looking visualizations on how lower layers work but no idea how later layers work. The intuition is built over training numerous networks and trying different hyperparameters, data shuffling, activations, etc. it’s absolutely brutal over here. If the theory was there, people like Karpathy who have great teacher vibes would’ve explained it for the mortal grad students or enthusiast tinkerers.

> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks

I say this less as an authoritative voice but more as an amused insider: Spend a week with some ML grad students and you will get a chuckle whenever somebody says we’re not some monkeys throwing things at GPUs.

replies(1): >>42962093 #
21. gessha ◴[] No.42961952{4}[source]
You can make a bigger ICE engine (like a container ship engine) and still understand how the whole thing works. Maybe there’s more parts moving but it still has the structure of an ICE engine.

With neural networks big or small, we got no clue what’s going on. You can observe the whole system, from the weights and biases, to the activations, gradients, etc and still get nothing.

On the other hand, one of the reasons why economics, psychology and politics are hard is because we can’t open up people’s heads and define and measure what they’re thinking.

replies(1): >>42962060 #
22. salemba ◴[] No.42962035{4}[source]
I think this is how the "child brain" works too. The better the parents and the environement are, the better the child evolution is :)
replies(1): >>43015969 #
23. bloomingkales ◴[] No.42962058{4}[source]
And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.

Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).

24. ijk ◴[] No.42962060{5}[source]
One way I've heard it summarized: Computer Science as a field is used to things being like physics or chemistry, but we've suddenly encountered something that behaves more like biology.
replies(1): >>42962185 #
25. bloomingkales ◴[] No.42962093{5}[source]
It may be as simple as this:

https://youtube.com/shorts/7GrecDNcfMc

Many many layers of that. It’s not a profound mechanism. We can understand how that works, but we’re dumbfounded how such a small mechanism is responsible for all this stuff going on inside a brain.

I don’t think we don’t understand, it’s a level beyond that. We can’t fathom the implications, that it could be that simple, just scaled up.

replies(1): >>42965342 #
26. timschmidt ◴[] No.42962196{3}[source]
And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.
replies(1): >>42964657 #
27. bloomingkales ◴[] No.42962374{4}[source]
Why would asking a question about ice cream trigger a consideration about all possible topics? As in, to formulate the answer, the LLM will consider the origin of Elephants even. It won’t be significant, but it will be factored in.

Why? In the spiritual realm, many postulated that even the Elephant you never met is part of your life.

None of this is a coincidence.

28. adamc ◴[] No.42962716[source]
The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.
29. red1reaper ◴[] No.42963026{7}[source]
"God" as a concept in unproven to exist, it is also impossible to prove, so for all intents and porpouses it doesn't exist.
replies(1): >>42963535 #
30. ◴[] No.42963535{8}[source]
31. ziofill ◴[] No.42963668[source]
What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.
32. t_mann ◴[] No.42964657{4}[source]
Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.
replies(1): >>42965278 #
33. timschmidt ◴[] No.42965278{5}[source]
Maybe I'm just a filthy computationalist, but the way I see it, the most accurate model of the universe is the one which makes the most accurate predictions with the fewest parameters.

The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.

My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.

As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...

I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.

replies(1): >>42966955 #
34. ClumsyPilot ◴[] No.42965302{4}[source]
> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work

Just like alchemists made enormous strides in chemistry, but their goal was to turn piss into gold.

35. ClumsyPilot ◴[] No.42965342{6}[source]
> Many many layers of that. It’s not a profound mechanism

Bad argument. Cavemen understood stone, but they could not build the aqueducts. Medieval people understood iron, water and fire but they could not make a steam engine

Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make

replies(1): >>42965612 #
36. bloomingkales ◴[] No.42965612{7}[source]
"Cavemen understood stone"

How far removed are you from a caveman is the better question. There would be quite some arrogance coming out of you to suggest the several million years gap is anything but an instant in the grand timeline. As in, you understood stone just yesterday ...

The monkey that found the stone is the monkey that built the cathedral. It's only a delusion the second monkey creates to separate it from the first monkey (a feeling of superiority, with the only tangible asset being "a certain amount of notable time passed since point A and point B").

"Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make"

You and I agree. That those simple things can truly create infinite possibilities. That's all I was saying, we cannot fathom it (either because infinity is hard to fathom, or that it's origins are humble - just a few core elements, or both, or something else).

Anyway, this can discussion can head into any direction.

37. Arthur_ODC ◴[] No.42965862[source]
So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?
38. cztomsik ◴[] No.42966394[source]
Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

39. cmgriffing ◴[] No.42966553[source]
This brings up an interesting thought too. A photo is just a lossy representation of the real world.

So it's lossy all the way down with LLMs, too.

Reality > Data created by a human > LLM > Distilled LLM

40. t_mann ◴[] No.42966955{6}[source]
The thing is, Lagrangian mechanics makes exactly the same predictions as Newtownian, and it starts from a foundation of just one principle (least action) instead of three laws, so it's arguably a sparser theory. It just makes calculations easier, especially for more complex systems, that's its raison d'être. So in a world where we don't know about relativity yet, both make the best predictions we know (and they always agree), but Newton's laws were discovered earlier. Do they suddenly stop being natural laws once Lagrangian mechanics is discovered? Standard physics curricula would not agree with you btw, they practically always teach Newtownian mechanics first and Lagrangian later, also because the latter is mathematically more involved.
replies(3): >>42967070 #>>42967186 #>>42986201 #
41. timschmidt ◴[] No.42967070{7}[source]
> Do they suddenly stop being natural laws once Lagrangian mechanics is discovered?

Not my question to answer, I think that lies in philosophical questions about what is a "law".

I see useful abstractions all the way down. The linked Asimov essay covers this nicely.

42. dragonwriter ◴[] No.42967186{7}[source]
Laws (in science, not government) are just a relationship that is consistently observed, so Newton's laws remain laws until contradictions were observed, regardless of the existence of or more alternative models which would predict them to hold.

The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.

(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)

replies(1): >>42979920 #
43. t_mann ◴[] No.42979920{8}[source]
Newtownian and Lagrangian mechanics are equivalent only in their predictions, not in their complexity - one requires three assumptions, the other just one. Now you say the fact that they have the same predictions makes them equivalent, and I agree. But it's clearly not compatible with what the other poster said about looking for the simplest possible way to explain a phenomenon. If you believe that that's how science should work, you'd need to discard theories as soon as simpler ones that make the same predictions are found (as in the case of Newtownian mechanics). It's a valid philosophical standpoint imho, but it's in opposition to how scientists generally approach Occam's razor, as evidenced eg by common physics curricula. That's what I was pointing out. Having to exclude Newtownian mechanics from what can be considered science is just one prominent consequence of the other poster's philosophical stance, one that could warrant reconsidering whether that's how you want to define it.
44. Cleonis ◴[] No.42986201{7}[source]
I will argue that 'has least action as foundation' does not in itself imply that Lagrangian mechanics is a sparser theory:

Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.

Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.

In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.

We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.

Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.

Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.

To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.

So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.

More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.

Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.

replies(1): >>42987402 #
45. t_mann ◴[] No.42987402{8}[source]
You seem to know more about this than me, but it seems to me that the first law does more than just induce a metric, I've always thought of it as positing inertia as an axiom.

There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.

replies(1): >>42993136 #
46. Cleonis ◴[] No.42993136{9}[source]
Indeed inertia. Theory of motion consists of describing the properties of Inertia.

In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.

In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.

Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.

That is why I argue that for Newtonian mechanics two principles are sufficient.

The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy

The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198

The work-energy theorem is the most important theorem of classical mechanics.

About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.

When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.

The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.

When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.

This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.

The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.

I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.

When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.

Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.

replies(1): >>42997618 #
47. t_mann ◴[] No.42997618{10}[source]
Nicely said, but I think then we are in agreement that Newtownian mechanics has a bit of redundancy that can be removed by switching to a Lagrangian framework, no? I think that's a situation where Occam's razor can be applied very cleanly: if we can make the exact same predictions with a sparser model.

Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.

48. fennecfoxy ◴[] No.43000430[source]
Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.
49. fennecfoxy ◴[] No.43000550{3}[source]
Eh I feel like that mostly just down to; yes transformers are a "next token predictor" but during fine tuning for instruct the attention related wagon slapped on the back is partially hijacked as a bridge from input token->sequences of connections in the weights.

For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.

I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).

50. cristiancavalli ◴[] No.43015969{5}[source]
Not at all — how many people were geniuses and their parents not? I can name several and I’m sure with a quick search you can too.
replies(1): >>43039510 #
51. iFreilicht ◴[] No.43039510{6}[source]
How is that relevant? A few examples do not disprove anything. It's pretty common knowledge that the more successful/rich etc. your parents were, the more likely you'll be successful/rich etc.

This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.