Popular/hot comments

(timkellogg.me)

Show context

mtrovo ◴[05 Feb 25 16:48 UTC] No.42951263[source]▶

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #

xg15 ◴[05 Feb 25 19:13 UTC] No.42953577[source]▶

>>42951263 #

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #

1. ZeljkoS ◴[05 Feb 25 23:42 UTC] No.42957002[source]▶

>>42953577 #

We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

replies(2): >>42959347 #>>42965862 #

2. 3abiton ◴[06 Feb 25 05:19 UTC] No.42959347[source]▶

>>42957002 (TP) #

So more 'mature' models might arise in the near future with less params and better benchmarks?

replies(3): >>42960280 #>>42960288 #>>42961518 #

3. raducu ◴[06 Feb 25 08:25 UTC] No.42960280[source]▶

>>42959347 #

"Better", but not better than the model they were distilled from, at least that's how I understand it.

replies(1): >>42962035 #

4. andreasmetsala ◴[06 Feb 25 08:27 UTC] No.42960288[source]▶

>>42959347 #

They might also be more biased and less able to adapt to new technology. Interesting times.

5. coder543 ◴[06 Feb 25 11:53 UTC] No.42961518[source]▶

>>42959347 #

That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.

6. salemba ◴[06 Feb 25 13:20 UTC] No.42962035{3}[source]▶

>>42960280 #

I think this is how the "child brain" works too. The better the parents and the environement are, the better the child evolution is :)

replies(1): >>43015969 #

7. Arthur_ODC ◴[06 Feb 25 19:52 UTC] No.42965862[source]▶

>>42957002 (TP) #

So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?

8. cristiancavalli ◴[11 Feb 25 18:04 UTC] No.43015969{4}[source]▶

>>42962035 #

Not at all — how many people were geniuses and their parents not? I can name several and I’m sure with a quick search you can too.

replies(1): >>43039510 #

9. iFreilicht ◴[13 Feb 25 18:38 UTC] No.43039510{5}[source]▶

>>43015969 #

How is that relevant? A few examples do not disprove anything. It's pretty common knowledge that the more successful/rich etc. your parents were, the more likely you'll be successful/rich etc.

This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.

↑

S1: A $6 R1 competitor?