S1: A $6 R1 competitor?

(timkellogg.me)

851 points tkellogg | 2 comments | 05 Feb 25 11:05 UTC | HN request time: 0.709s | source

Show context

mtrovo ◴[05 Feb 25 16:48 UTC] No.42951263[source]▶

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

replies(16): >>42951704 #>>42951764 #>>42951829 #>>42953577 #>>42954518 #>>42956436 #>>42956535 #>>42956674 #>>42957820 #>>42957909 #>>42958693 #>>42960400 #>>42960464 #>>42961717 #>>42964057 #>>43000399 #

xg15 ◴[05 Feb 25 19:13 UTC] No.42953577[source]▶

>>42951263 #

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

replies(5): >>42955228 #>>42956999 #>>42957002 #>>42959159 #>>42966394 #

MR4D ◴[06 Feb 25 04:42 UTC] No.42959159[source]▶

>>42953577 #

I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

replies(4): >>42959654 #>>42963668 #>>42966553 #>>43000430 #

umeshunni ◴[06 Feb 25 06:30 UTC] No.42959654[source]▶

>>42959159 #

> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

replies(3): >>42960472 #>>42961599 #>>42962196 #

1. kedarkhand ◴[06 Feb 25 08:56 UTC] No.42960472[source]▶

>>42959654 #

Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken

replies(1): >>42962058 #

2. bloomingkales ◴[06 Feb 25 13:22 UTC] No.42962058[source]▶

>>42960472 (TP) #

And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.

Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).

↑