←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 1 comments | | HN request time: 0s | source
Show context
mtrovo ◴[] No.42951263[source]
I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?
replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #
xg15 ◴[] No.42953577[source]
I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #
MR4D ◴[] No.42959159[source]
I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

replies(4): >>42959654 #>>42963668 #>>42966553 #>>43000430 #
umeshunni ◴[] No.42959654[source]
> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

replies(3): >>42960472 #>>42961599 #>>42962196 #
timschmidt ◴[] No.42962196[source]
And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.
replies(1): >>42964657 #
t_mann ◴[] No.42964657[source]
Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.
replies(1): >>42965278 #
timschmidt ◴[] No.42965278[source]
Maybe I'm just a filthy computationalist, but the way I see it, the most accurate model of the universe is the one which makes the most accurate predictions with the fewest parameters.

The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.

My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.

As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...

I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.

replies(1): >>42966955 #
t_mann ◴[] No.42966955[source]
The thing is, Lagrangian mechanics makes exactly the same predictions as Newtownian, and it starts from a foundation of just one principle (least action) instead of three laws, so it's arguably a sparser theory. It just makes calculations easier, especially for more complex systems, that's its raison d'être. So in a world where we don't know about relativity yet, both make the best predictions we know (and they always agree), but Newton's laws were discovered earlier. Do they suddenly stop being natural laws once Lagrangian mechanics is discovered? Standard physics curricula would not agree with you btw, they practically always teach Newtownian mechanics first and Lagrangian later, also because the latter is mathematically more involved.
replies(3): >>42967070 #>>42967186 #>>42986201 #
dragonwriter ◴[] No.42967186[source]
Laws (in science, not government) are just a relationship that is consistently observed, so Newton's laws remain laws until contradictions were observed, regardless of the existence of or more alternative models which would predict them to hold.

The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.

(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)

replies(1): >>42979920 #
1. t_mann ◴[] No.42979920[source]
Newtownian and Lagrangian mechanics are equivalent only in their predictions, not in their complexity - one requires three assumptions, the other just one. Now you say the fact that they have the same predictions makes them equivalent, and I agree. But it's clearly not compatible with what the other poster said about looking for the simplest possible way to explain a phenomenon. If you believe that that's how science should work, you'd need to discard theories as soon as simpler ones that make the same predictions are found (as in the case of Newtownian mechanics). It's a valid philosophical standpoint imho, but it's in opposition to how scientists generally approach Occam's razor, as evidenced eg by common physics curricula. That's what I was pointing out. Having to exclude Newtownian mechanics from what can be considered science is just one prominent consequence of the other poster's philosophical stance, one that could warrant reconsidering whether that's how you want to define it.