(timkellogg.me)

Show context

jebarker ◴[05 Feb 25 14:31 UTC] No.42948939[source]▶

S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.

replies(2): >>42949122 #>>42953281 #

1. pas ◴[05 Feb 25 14:43 UTC] No.42949122[source]▶

>>42948939 #

can you please elaborate on the wait tokens? what's that? how do they work? is that also from the R1 paper?

replies(2): >>42949165 #>>42956305 #

2. jebarker ◴[05 Feb 25 14:46 UTC] No.42949165[source]▶

>>42949122 (TP) #

The same idea is in both the R1 and S1 papers (<think> tokens are used similarly). Basically they're using special tokens to mark in the prompt where the LLM should think more/revise the previous response. This can be repeated many times until some stop criteria occurs. S1 manually inserts these with heuristics, R1 learns the placement through RL I think.

replies(1): >>42949855 #

3. whimsicalism ◴[05 Feb 25 15:28 UTC] No.42949855[source]▶

>>42949165 #

? theyre not special tokens really

replies(1): >>42952423 #

4. jebarker ◴[05 Feb 25 17:55 UTC] No.42952423{3}[source]▶

>>42949855 #

i'm not actually sure whether they're special tokens in the sense of being in the vocabulary

replies(1): >>42952547 #

5. whimsicalism ◴[05 Feb 25 18:02 UTC] No.42952547{4}[source]▶

>>42952423 #

<think> might be i think "wait" is tokenized like any other in the pretraining

6. throwaway314155 ◴[05 Feb 25 22:33 UTC] No.42956305[source]▶

>>42949122 (TP) #

There's a decent explanation in the article, just FYI.

↑

S1: A $6 R1 competitor?