Reasoning is not model improvement

(manidoraisamy.com)

60 points QueensGambit | 1 comments | 23 Oct 25 15:39 UTC | HN request time: 0s | source

Show context

QueensGambit ◴[23 Oct 25 15:39 UTC] No.45683114[source]▶

Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #

ACCount37 ◴[23 Oct 25 20:06 UTC] No.45686379[source]▶

>>45683114 #

Wrong on every count, basically.

1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.

2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.

3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.

As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.

replies(2): >>45687282 #>>45695425 #

throwthrowrow ◴[23 Oct 25 21:15 UTC] No.45687282[source]▶

>>45686379 #

Question #1 was on the model's ability to handle arithmetic. The answer to question seems to be unrelated, at least to me: "you can enable or disable tool use in most APIs".

The original question still stands: do recent LLMs have an inherent knowledge of arithmetic, or do they have to offload the calculation to some other non-LLM system?

replies(2): >>45687577 #>>45688727 #

1. Terr_ ◴[23 Oct 25 23:21 UTC] No.45688727[source]▶

>>45687282 #

Some nice charts here [0], which IMO means LLMs are getting very good at guessing answers to certain arithmetic operations, but they don't actually perform it in a logical fashion.

[0] https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator...

↑