(huggingface.co)

345 points kashifr | 2 comments | 08 Jul 25 16:13 UTC | HN request time: 0.521s | source

1. BarakWidawsky ◴[08 Jul 25 18:24 UTC] No.44502634[source]▶

It’s interesting that it looks like they didn’t apply their own RL to the model, and instead fine tuned on reasoning traces from large datasets and generating reasoning traces from larger models

replies(1): >>44502761 #

2. lewtun ◴[08 Jul 25 18:38 UTC] No.44502761[source]▶

>>44502634 (TP) #

Indeed we opted for offline methods like Anchored Preference Optimization as we found in the Open R1 project that doing multi-task RL on small models is quite a hassle to get right. With offline methods, you focus much more on dataset curation / generation, but that still provides faster iteration cycles for the model scale we’re dealing with!

↑

Smollm3: Smol, multilingual, long-context reasoner LLM