←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 1 comments | | HN request time: 0.21s | source
Show context
jebarker ◴[] No.42948939[source]
S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.
replies(2): >>42949122 #>>42953281 #
1. sfink ◴[] No.42953281[source]
I agree, but LLMs in general have a horrendously bad smell in terms of efficiency. s1 and r1 are just proving it.

The models' latent spaces are insanely large. The vast, vast majority pretty much has to be irrelevant and useless, it's just that the training commandeers random fragments of that space to link up the logic they need and it's really hard to know which of the weights are useless, which are useful but interchangeable with other weights, and which are truly load-bearing. You could probably find out easily by testing the model against every possible thing you ever might want it to do, just as soon as someone gets around to enumerating that non-enumerable collection of tasks.

These bogus <wait> tokens kind of demonstrate that the models are sort of desperate to escape the limitations imposed by the limited processing they're allowed to do -- they'll take advantage of thinking time even when it's provided in the silliest manner possible. It's amazing what you can live if it's all you have!

(Apologies for the extended anthropomorphizing.)