←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 6 comments | | HN request time: 1.741s | source | bottom
Show context
jebarker ◴[] No.42948939[source]
S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.
replies(2): >>42949122 #>>42953281 #
1. pas ◴[] No.42949122[source]
can you please elaborate on the wait tokens? what's that? how do they work? is that also from the R1 paper?
replies(2): >>42949165 #>>42956305 #
2. jebarker ◴[] No.42949165[source]
The same idea is in both the R1 and S1 papers (<think> tokens are used similarly). Basically they're using special tokens to mark in the prompt where the LLM should think more/revise the previous response. This can be repeated many times until some stop criteria occurs. S1 manually inserts these with heuristics, R1 learns the placement through RL I think.
replies(1): >>42949855 #
3. whimsicalism ◴[] No.42949855[source]
? theyre not special tokens really
replies(1): >>42952423 #
4. jebarker ◴[] No.42952423{3}[source]
i'm not actually sure whether they're special tokens in the sense of being in the vocabulary
replies(1): >>42952547 #
5. whimsicalism ◴[] No.42952547{4}[source]
<think> might be i think "wait" is tokenized like any other in the pretraining
6. throwaway314155 ◴[] No.42956305[source]
There's a decent explanation in the article, just FYI.