←back to thread

345 points kashifr | 2 comments | | HN request time: 0.521s | source
1. BarakWidawsky ◴[] No.44502634[source]
It’s interesting that it looks like they didn’t apply their own RL to the model, and instead fine tuned on reasoning traces from large datasets and generating reasoning traces from larger models
replies(1): >>44502761 #
2. lewtun ◴[] No.44502761[source]
Indeed we opted for offline methods like Anchored Preference Optimization as we found in the Open R1 project that doing multi-task RL on small models is quite a hassle to get right. With offline methods, you focus much more on dataset curation / generation, but that still provides faster iteration cycles for the model scale we’re dealing with!