Reasoning is not model improvement

(manidoraisamy.com)

Show context

QueensGambit ◴[23 Oct 25 15:39 UTC] No.45683114[source]▶

Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #

1. ACCount37 ◴[23 Oct 25 20:06 UTC] No.45686379[source]▶

>>45683114 #

Wrong on every count, basically.

1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.

2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.

3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.

As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.

replies(2): >>45687282 #>>45695425 #

2. throwthrowrow ◴[23 Oct 25 21:15 UTC] No.45687282[source]▶

>>45686379 (TP) #

Question #1 was on the model's ability to handle arithmetic. The answer to question seems to be unrelated, at least to me: "you can enable or disable tool use in most APIs".

The original question still stands: do recent LLMs have an inherent knowledge of arithmetic, or do they have to offload the calculation to some other non-LLM system?

replies(2): >>45687577 #>>45688727 #

3. ACCount37 ◴[23 Oct 25 21:39 UTC] No.45687577[source]▶

>>45687282 #

The knowledge was never the bottleneck for that, not since the days of GPT-3. The ability to execute on it was.

Which includes, among other things, the underappreciated metacognitive skill of "being able to decide when to do math quick and dirty, in one forward pass, and when to write it out explicitly and solve it step by step".

Today's frontier LLMs can do that. A lot of training for "reasoning" is just training for "execute on your knowledge reliably". They usually can solve math problems with no tool calls. But they will tool call for more complex math when given an option to.

4. Terr_ ◴[23 Oct 25 23:21 UTC] No.45688727[source]▶

>>45687282 #

Some nice charts here [0], which IMO means LLMs are getting very good at guessing answers to certain arithmetic operations, but they don't actually perform it in a logical fashion.

[0] https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator...

5. vrighter ◴[24 Oct 25 15:06 UTC] No.45695425[source]▶

>>45686379 (TP) #

3 years in? How long had you been hibernating for 3 years ago?

replies(1): >>45696580 #

6. ACCount37 ◴[24 Oct 25 16:59 UTC] No.45696580[source]▶

>>45695425 #

People in the industry started saying "oh shit this might be big" at a point between GPT-1 and GPT-2, but there were plenty of naysayers too. It only hit the mainstream with ChatGPT.

Which was also when the capabilities of LLMs became completely impossible to either ignore or excuse as "just matching seen data". But that was, in practice, solvable simply by increasing the copium intake.

↑