Reasoning is not model improvement

(manidoraisamy.com)

60 points QueensGambit | 1 comments | 23 Oct 25 15:39 UTC | HN request time: 0.232s | source

Show context

QueensGambit ◴[23 Oct 25 15:39 UTC] No.45683114[source]▶

Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #

simonw ◴[23 Oct 25 19:59 UTC] No.45686295[source]▶

>>45683114 #

1 isn't true. o1 doesn't have access to a Python interpreter unless you explicitly grant it access.

If you call the OpenAI API for o1 and ask it to multiply two large numbers it cannot use Python to help it.

Try this:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        }
      }'

Here's what I got back just now: https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e...

o1 correctly answered the multiplication by running a long multiplication process entirely through reasoning tokens.

replies(4): >>45686406 #>>45686717 #>>45686779 #>>45687000 #

1. simonw ◴[23 Oct 25 20:38 UTC] No.45686779[source]▶

>>45686295 #

I know this isn't using tools (e.g. the Python interpreter) because you have to turn those on explicitly. That's not actually supported for o1 in the API but you can do it for GPT-5 like this:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "gpt-5",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        },
        "tools": [
          {
            "type": "code_interpreter",
            "container": {"type": "auto"}
          }
        ]
      }'

Here's the response: https://gist.github.com/simonw/c53c373fab2596c20942cfbb235af...

Note this bit where the code interpreter Python tool is called:

    {
      "id": "rs_080a5801ca14ad990068fa91f2779081a0ad166ee263153d98",
      "type": "reasoning",
      "summary": [
        {
          "type": "summary_text",
          "text": "**Calculating large product**\n\nI see that I need to compute the product of two large numbers, which involves big integer multiplication. It\u2019s a straightforward task, and given that I can use Python, that seems like the best route to avoid any potential errors. The user specifically asked for this multiplication, so I\u2019ll go ahead and use the python tool for accurate analysis. Let\u2019s get started on that!"
        }
      ]
    },
    {
      "id": "ci_080a5801ca14ad990068fa91f4dbe481a09eb646af049541c6",
      "type": "code_interpreter_call",
      "status": "completed",
      "code": "a = 87654321\r\nb = 98765432\r\na*b",
      "container_id": "cntr_68fa91f12f008191a359f1eeaed561290c438cc21b3fc083",
      "outputs": null
    }

↑