Reasoning is not model improvement

(manidoraisamy.com)

60 points QueensGambit | 3 comments | 23 Oct 25 15:39 UTC | HN request time: 0s | source

Show context

simonw ◴[23 Oct 25 20:19 UTC] No.45686545[source]▶

There are several key points in this that I don't think are accurate.

> When you ask o1 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result.

That's not true of the model itself, see my comment here which demonstrates it multiplying two large numbers via the OpenAI API without using Python: https://news.ycombinator.com/item?id=45683113#45686295

On GPT-5 it says:

> What they delivered barely moved the needle on code generation, the one capability that everything else depends on.

I don't think that holds up. GPT-5 is wildly better at coding that GPT-4o was (and got even better with GPT-5-Codex). A lot of people have been ditching Claude for GPT-5 for coding stuff, and Anthropic held the throne for "best coding model" for well over a year prior to that.

From the conclusion:

> All [AI coding startups] betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards.

The models really don't need to get better at generating code right now for the economic impact to be profound. If progress froze today we could still spend the next 12+ months finding new ways to get better results for code out of our current batch of models.

replies(3): >>45686818 #>>45686948 #>>45687104 #

QueensGambit ◴[23 Oct 25 21:03 UTC] No.45687104[source]▶

>>45686545 #

OP here. Thanks for the thoughtful reply. Curious if you’ve measured o1’s accuracy and token cost with tool use enabled vs disabled? Wondering if python sandbox gives higher accuracy and lower cost, since internal reasoning chains are longer and pricier.

[Edit] Its probably premature to argue without the above data, but if we assume tool use gives ~100% accuracy and reasoning-only ~90%, then that 10% gap might represent the loss in the probabilistic model: either from functional ambiguity in the model itself or symbolic ambiguity from tokenization?

replies(1): >>45687228 #

1. simonw ◴[23 Oct 25 21:11 UTC] No.45687228[source]▶

>>45687104 #

I am 100% sure that using Python is faster and cheaper.

My o1 call in https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e... used 16 input tokens and produced 2357 output tokens (1664 were reasoning). At o1's price that's 14 cents! https://www.llm-prices.com/#it=16&ot=2357&ic=15&cic=7.5&oc=6...

I can't call o1 with the Python tool via the API, so I'll have to provide the price for the GPT-5 example in https://gist.github.com/simonw/c53c373fab2596c20942cfbb235af... - that one was 777 input tokens and 140 output tokens. Why 777 input tokens? That's a bit of a mystery to me - my assumption is that a bunch of extra system prompt stuff gets stuffed on describing that coding tool.

GPT-5 is hugely cheaper than o1 so that cost 0.22 cents (almost a quarter of a cent) - but if o1 ran with the same number of tokens it would only cost 1.94 cents: https://www.llm-prices.com/#it=777&ot=130&sel=gpt-5%2Co1-pre...

replies(2): >>45688317 #>>45695997 #

2. QueensGambit ◴[23 Oct 25 22:41 UTC] No.45688317[source]▶

>>45687228 (TP) #

wow, tool use seems to reduce total tokens by ~4 to 10× and cost by orders of magnitude. I wonder what the accuracy difference would be. I'm going to try multiplying larger and larger numbers to see how accuracy compares between tool use and pure reasoning.

3. QueensGambit ◴[24 Oct 25 16:05 UTC] No.45695997[source]▶

>>45687228 (TP) #

I’ve updated my article (replaced GPT-5 with ChatGPT-5 in this section):

When you ask the latest model, ChatGPT-5 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result. Unlike ChatGPT-3, which at least attempted arithmetic internally (and often failed), ChatGPT-5 delegates computation to external tools. [1]

And added this note:

[1] There are 2 ways to multiply numbers in GPT-5:

- Python mode, which uses python sandbox as mentioned above

- No tool mode, which uses internal reasoning

Python mode is approximately 2x more accurate than no tool mode in FrontierMath (26.3% vs 13.5% accuracy on expert level math). Python mode is also 4x to 10x more cost effective than no tool mode. The GPT-5 API uses no-tool mode by default (tools must be explicitly enabled in API calls), while ChatGPT UI likely uses Python mode by default since Advanced Data Analysis is enabled by default for all Plus, Team, and Enterprise subscribers. This creates a significant cost optimization for OpenAI in the consumer product, while API users bear the full cost of inefficient reasoning unless they manually configure tool use.

---

Thanks again for flagging the inaccuracy, Simon! If you think any part of this update still misrepresents the model behavior, I’d love your input.

↑