←back to thread

60 points QueensGambit | 2 comments | | HN request time: 0s | source
Show context
QueensGambit ◴[] No.45683114[source]
Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #
simonw ◴[] No.45686295[source]
1 isn't true. o1 doesn't have access to a Python interpreter unless you explicitly grant it access.

If you call the OpenAI API for o1 and ask it to multiply two large numbers it cannot use Python to help it.

Try this:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        }
      }'
Here's what I got back just now: https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e...

o1 correctly answered the multiplication by running a long multiplication process entirely through reasoning tokens.

replies(4): >>45686406 #>>45686717 #>>45686779 #>>45687000 #
alganet ◴[] No.45686717[source]
I see this:

> "tool_choice": "auto"

> "parallel_tool_calls": true

Can you remake the API call explicitly asking it to not perform any tool calls?

replies(1): >>45686747 #
simonw ◴[] No.45686747[source]
I'm doing that here. It only makes tool calls if you give it a JSON list of tools it can call.

Those are its default settings whether or not there are tools configured. You can set tool_choice to the name of a specific tool in order to force it to use that tool.

I added my comment here to show an example of an API call with Python enabled: https://news.ycombinator.com/item?id=45686779

Update: Looks like you can add "tool_choice": "none" to prevent even tools you have configured from being called. https://platform.openai.com/docs/api-reference/responses/cre...

replies(1): >>45686837 #
alganet ◴[] No.45686837[source]
There are three possible generic values for `tool_choice`: none, auto and required.

Can you remake the call explicitly using the value `none`?

Maybe it's not using Python, but it's using something else. I think it's a good test. If you're right, then the response shouldn't change.

Update: `auto` is ambiguous. It doesn't say whether is picking from your selection of tools or the pool of all available tools. Explicit is better than implicit. I think you should do the call with `none`, it can't hurt and it can prove me wrong.

replies(1): >>45686962 #
simonw ◴[] No.45686962[source]
I just ran it like this and got the same correct result:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        },
        "tool_choice": "none"
      }'
Result: https://gist.github.com/simonw/52888b6546dcfc6a9dcc75bcf171b...

I promise you it is not using anything else. It is performing long multiplication entirely through model reasoning.

(I suggest getting your own OpenAI API key so you can try these things yourself.)

replies(2): >>45687099 #>>45687291 #
1. simonw ◴[] No.45687099{6}[source]
And if you're still not convinced the definitive evidence is to run a reasoning model on your own machine.

OpenAI's gpt-oss-20b is a 12GB download for LM Studio from https://lmstudio.ai/models/openai/gpt-oss-20b

It turns out it's powerful enough to solve this. Here's the thinking trace:

    We need product. Compute. Use approximate? Let's compute exactly. 87,654,321 * 98,765,432.

    We can use multiplication: Let's do 87,654,321 * 98,765,432.

    Compute using decomposition: 98,765,432 = 100,000,000 -1,234,568.

    So product = 87,654,321100,000,000 - 87,654,3211,234,568.

    First term: 8,765,432,100,000,000? Wait multiply: 87,654,321 *100,000,000 = 8,765,432,100,000,000.

    Now compute second term: 87,654,321 *1,234,568.

    Compute 1,234,568*87,654,321. Use maybe multiply 87,654,321 1,234,568 = 87,654,321(1,200,000+34,568).

    First: 1,200,000 = 87,654,32112100,000. 87,654,32112=1,051,851,852? Wait compute: 87,654,321*10=876,543,210; *2=175,308,642; sum =1,051,851,852. Then times100,000 =>105,185,185,200,000.

    Second: 34,568. Compute 87,654,32134,568 = 87,654,321*(30,000+4,568).

    30,000: 87,654,3213=262,962,963; times10,000 =>2,629,629,630,000.

    4,568: compute 87,654,3214,568 = *4000 + *500 + *60 + *8.

    4000: 87,654,3214=350,617,284; times1000 =>350,617,284,000.

    *500: *5=438,271,605; times100 =>43,827,160,500.

    *60: *6=525,925,926; times10 =>5,259,259,260.

    *8: 700,834,568? Wait *8=700,834,568.

    Sum these: 350,617,284,000
    +43,827,160,500 =394,444,444,500
    +5,259,259,260 =399,703,703,760
    +700,834,568 =400,404,538,328.

    Now add *30,000 part: 2,629,629,630,000 +400,404,538,328 =3,030,034,168,328.

    Now add to first part: 105,185,185,200,000 +3,030,034,168,328 =108,215,219,368,328.

    Thus product second term =108,215,219,368,328.

    Now subtract from first term: 8,765,432,100,000,000 -108,215,219,368,328 = 8,657,216,880,631,672.

    So answer: 8,657,216,880,631,672.
And a screenshot: https://gist.github.com/simonw/a8929c0df5f204981652871555420...
replies(1): >>45687769 #
2. photonthug ◴[] No.45687769[source]
Thanks for doing this. OpenAI is not in fact open, so referencing their claims as obviously true on anything else is just a non-starter. Counterpoint though, it's been a while since I've run this kind of experiment locally, so I started one too. For reasoning I only have qwen3:latest and I won't clutter the thread with the output, but it's complete junk.

To summarize, with large numbers it goes nuts trying to find a trick or shortcut. After I cut off dead-ends in several trials, it always eventually considers long form addition, then ultimately rejects it as "tedious" and starts looking for "patterns". Wait, let me use the standard multiplication algorithm step by step, oh that's a lot of steps, break it down into parts. Let me think. Over ~45 minutes of thinking (I'm on CPU), but it basically cannot follow one strategy long enough to complete the work even if landed on a sensible approach.

For multiplying two-digit numbers, it does better. Starts using the "manual way", messes up certain steps, then gets the right answer for sub-problems anyway because obviously those are memoized somewhere. But at least once, it got the correct answer with the correct approach.

I think this raises the question, if you were to double the size of your input numbers and let the more powerful local model answer, could it still perform the process? Does that stop working for any reason at some point before the context window overflows?