S1: A $6 R1 competitor?

(timkellogg.me)

851 points tkellogg | 1 comments | 05 Feb 25 11:05 UTC | HN request time: 0.349s | source

Show context

mtrovo ◴[05 Feb 25 16:48 UTC] No.42951263[source]▶

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #

xg15 ◴[05 Feb 25 19:13 UTC] No.42953577[source]▶

>>42951263 #

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #

1. cztomsik ◴[06 Feb 25 20:53 UTC] No.42966394[source]▶

>>42953577 #

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

↑