S1: A $6 R1 competitor?

I agree, but LLMs in general have a horrendously bad smell in terms of efficiency. s1 and r1 are just proving it.

The models' latent spaces are insanely large. The vast, vast majority pretty much has to be irrelevant and useless, it's just that the training commandeers random fragments of that space to link up the logic they need and it's really hard to know which of the weights are useless, which are useful but interchangeable with other weights, and which are truly load-bearing. You could probably find out easily by testing the model against every possible thing you ever might want it to do, just as soon as someone gets around to enumerating that non-enumerable collection of tasks.

These bogus <wait> tokens kind of demonstrate that the models are sort of desperate to escape the limitations imposed by the limited processing they're allowed to do -- they'll take advantage of thinking time even when it's provided in the silliest manner possible. It's amazing what you can live if it's all you have!

(Apologies for the extended anthropomorphizing.)