S1: A $6 R1 competitor?

(timkellogg.me)

851 points tkellogg | 1 comments | 05 Feb 25 11:05 UTC | HN request time: 0s | source

Show context

mtrovo ◴[05 Feb 25 16:48 UTC] No.42951263[source]▶

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #

xg15 ◴[05 Feb 25 19:13 UTC] No.42953577[source]▶

>>42951263 #

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #

teruakohatu ◴[05 Feb 25 21:13 UTC] No.42955228[source]▶

>>42953577 #

> still have no real comprehensive understanding how the models work.

We do understand how they work, we just have not optimised their usage.

For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

replies(3): >>42955842 #>>42955941 #>>42962716 #

gessha ◴[05 Feb 25 22:04 UTC] No.42955941[source]▶

>>42955228 #

Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.

Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.

Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.

replies(2): >>42959322 #>>42960342 #

brookst ◴[06 Feb 25 05:14 UTC] No.42959322[source]▶

>>42955941 #

Isn't that just scale? Even small LLMs have more parts than any car.

LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.

replies(2): >>42959929 #>>42961952 #

1. ChymeraXYZ ◴[06 Feb 25 07:16 UTC] No.42959929[source]▶

>>42959322 #

Could be, but it does not change the fact that we do not understand them as of now.

↑