Show HN: Steiner – An open-source reasoning model inspired by OpenAI o1

1. Mr_Bees69 ◴[22 Oct 24 16:15 UTC] No.41915821[source]▶

>>41915735 (OP) #

Really hope this goes somewhere, o1 without openai's costs and restrictions would be sweet.

replies(2): >>41916023 #>>41916629 #

2. peakji ◴[22 Oct 24 16:32 UTC] No.41916023[source]▶

>>41915821 (TP) #

The model can already answer some tricky questions that other models (including GPT-4o) have failed to address, achieving a +5.56 improvement on the GPQA-Diamond dataset. Unfortunately, it has not yet managed to reproduce inference-time scaling. I will continue to explore different approaches!

replies(1): >>41917314 #

3. ActorNightly ◴[22 Oct 24 17:35 UTC] No.41916629[source]▶

>>41915821 (TP) #

OpenAIs o1 isnt really going that far though. Its definitelly better in some areas, but not overall better.

Im wondering if we can abstract chain of thought further down into the computation levels to replace a lot of matrix multiply. Like smaller transformers with less parameters and more selection of which transformer to use through search.

4. swyx ◴[22 Oct 24 18:38 UTC] No.41917314[source]▶

>>41916023 #

not sure i understand the rsults. its based on qwen 32b which is 49.49, and your best model is 53.54. results havent shown that your approach adds significant value yet.

can you compare with just qwen 32b with CoT?

replies(1): >>41917409 #

5. peakji ◴[22 Oct 24 18:49 UTC] No.41917409{3}[source]▶

>>41917314 #

The result for Qwen2.5-32B (49.49) is using CoT prompting. Only Steiner models do not use CoT prompting.

More importantly, I highly recommend to try these out firsthand (not only Steiner, but all reasoning models). You'll find that these reasoning models can solve many problems that other models with the same parameter size cannot handle. The existing benchmarks may not reflect this well, as I mentioned in the article:

"... automated evaluation benchmarks, which are primarily composed of multiple-choice questions and may not fully reflect the capabilities of reasoning models. During the training phase, reasoning models are encouraged to engage in open-ended exploration of problems, whereas multiple-choice questions operate under the premise that "the correct answer must be among the options." This makes it evident that verifying options one by one is a more efficient approach. In fact, existing large language models have, consciously or unconsciously, mastered this technique, regardless of whether special prompts are used. Ultimately, it is this misalignment between automated evaluation and genuine reasoning requirements that makes me believe it is essential to open-source the model for real human evaluation and feedback."

replies(1): >>41922121 #

6. swyx ◴[23 Oct 24 05:35 UTC] No.41922121{4}[source]▶

>>41917409 #

thanks, congrats on shipping.