S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.
replies(2):