Reasoning is not model improvement

(manidoraisamy.com)

60 points QueensGambit | 2 comments | 23 Oct 25 15:39 UTC | HN request time: 0s | source

Show context

QueensGambit ◴[23 Oct 25 15:39 UTC] No.45683114[source]▶

Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #

ACCount37 ◴[23 Oct 25 20:06 UTC] No.45686379[source]▶

>>45683114 #

Wrong on every count, basically.

1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.

2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.

3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.

As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.

replies(2): >>45687282 #>>45695425 #

1. vrighter ◴[24 Oct 25 15:06 UTC] No.45695425[source]▶

>>45686379 #

3 years in? How long had you been hibernating for 3 years ago?

replies(1): >>45696580 #

2. ACCount37 ◴[24 Oct 25 16:59 UTC] No.45696580[source]▶

>>45695425 (TP) #

People in the industry started saying "oh shit this might be big" at a point between GPT-1 and GPT-2, but there were plenty of naysayers too. It only hit the mainstream with ChatGPT.

Which was also when the capabilities of LLMs became completely impossible to either ignore or excuse as "just matching seen data". But that was, in practice, solvable simply by increasing the copium intake.

↑