←back to thread

60 points QueensGambit | 5 comments | | HN request time: 0.02s | source
Show context
QueensGambit ◴[] No.45683114[source]
Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #
simonw ◴[] No.45686295[source]
1 isn't true. o1 doesn't have access to a Python interpreter unless you explicitly grant it access.

If you call the OpenAI API for o1 and ask it to multiply two large numbers it cannot use Python to help it.

Try this:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        }
      }'
Here's what I got back just now: https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e...

o1 correctly answered the multiplication by running a long multiplication process entirely through reasoning tokens.

replies(4): >>45686406 #>>45686717 #>>45686779 #>>45687000 #
alganet ◴[] No.45686717[source]
I see this:

> "tool_choice": "auto"

> "parallel_tool_calls": true

Can you remake the API call explicitly asking it to not perform any tool calls?

replies(1): >>45686747 #
simonw ◴[] No.45686747[source]
I'm doing that here. It only makes tool calls if you give it a JSON list of tools it can call.

Those are its default settings whether or not there are tools configured. You can set tool_choice to the name of a specific tool in order to force it to use that tool.

I added my comment here to show an example of an API call with Python enabled: https://news.ycombinator.com/item?id=45686779

Update: Looks like you can add "tool_choice": "none" to prevent even tools you have configured from being called. https://platform.openai.com/docs/api-reference/responses/cre...

replies(1): >>45686837 #
alganet ◴[] No.45686837[source]
There are three possible generic values for `tool_choice`: none, auto and required.

Can you remake the call explicitly using the value `none`?

Maybe it's not using Python, but it's using something else. I think it's a good test. If you're right, then the response shouldn't change.

Update: `auto` is ambiguous. It doesn't say whether is picking from your selection of tools or the pool of all available tools. Explicit is better than implicit. I think you should do the call with `none`, it can't hurt and it can prove me wrong.

replies(1): >>45686962 #
simonw ◴[] No.45686962[source]
I just ran it like this and got the same correct result:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        },
        "tool_choice": "none"
      }'
Result: https://gist.github.com/simonw/52888b6546dcfc6a9dcc75bcf171b...

I promise you it is not using anything else. It is performing long multiplication entirely through model reasoning.

(I suggest getting your own OpenAI API key so you can try these things yourself.)

replies(2): >>45687099 #>>45687291 #
alganet ◴[] No.45687291[source]
Looks legit.

I can see this call now has a lot more tokens for the reasoning steps. Maybe that's normal variance though.

(I don't have a particular interest in proving or disproving LLM things, so there's no incentive for me to get a key). There was an ambiguous point in the "proof", I just highlighted it.

replies(1): >>45687320 #
simonw ◴[] No.45687320[source]
If you want to write about LLMs I really strongly recommend getting an API key for the major vendors! It's really useful being able to run quick experiments like this one if you want to develop a deeper understanding of what they can and cannot do.

You can also get an account with something like https://openrouter.ai/ which gives you one key to use with multiple different backends.

Or use GitHub Models which gives you free albeit limited access to a bunch at once. https://github.com/marketplace/models

replies(1): >>45687514 #
alganet ◴[] No.45687514[source]
I want to write about thinking crictically, specially but not limited to a software development context.

Lots of people don't have resources to invest in LLMs (either self-hosted or not). They rely on what other people say. And people get caught in the hype all the time. As it turns out, lots of hype nowadays is around LLMs, so that's where I'll go.

I was skeptic about LK99. Didn't had the resources to independently verify it. It doesn't mean I don't believe in superconductors or that I should have no say in it.

Some of that hype will be justified, some will not. And that's exactly what I expect from this kind of technology.

replies(1): >>45687554 #
simonw ◴[] No.45687554[source]
At this point most of the top tier LLMs are available for free across most of the world. If people aren't experimenting with LLMs it's not due to financial cost, it's due to either time constraints (figuring this stuff out does take a bunch of work) or because they find the field either uninteresting or downright scary (or both).
replies(1): >>45687809 #
1. alganet ◴[] No.45687809[source]
I think you're missing something here.

I can invest lots of time in Linux, for example. I don't know how to write a driver for it, but I know I could learn how to do it. If there's a bug in a driver, there's nothing stopping me except my own will to learn. I can also do it in a potato, or my phone.

I can experiment with free tier LLMs, but that's as far as I will go. It's not just about me, that is as far as 99% of the developers will go.

So, it's not uninteresting because it's boring or something. It's uninteresting because it puts a price on learning. That horizon of "if there's a bug in it, I can fix it" is severely limited. That's a price most free software developers are not considering worthy. There's a lot of us.

replies(1): >>45687884 #
2. simonw ◴[] No.45687884[source]
I don't understand, what am I missing?

I love learning about software. That's why I'm leaning so heavily on LLMs these days - they let me learn so much faster, and let me dig into whole new areas that previously I would never have considered experimenting with.

Just this week LLMs helped me figure out how to run Perl inside WebAssembly in a browser... and then how to compile 25-year-old C code to run in WebAssembly in the browser too. https://simonwillison.net/2025/Oct/22/sloccount-in-webassemb...

If I'd done this without LLMs I might have learned more of the underlying details... but realistically I wouldn't have done this at all, because my interest in Perl and C in WebAssembly is not strong enough to justify investing more than a few hours of effort.

replies(1): >>45689125 #
3. alganet ◴[] No.45689125[source]
I would love to train an LLM from scratch to help me with some problems that they're not good at, but I can't, because it costs thousands of dollars to do so. You probably can't either, or can just in a very limited capacity (agents, or maybe LoRa).

A while back, I didn't even knew those problems existed. It took me a while to understand them and why they're interesting and lots of people spend time on them.

I have tried to adapt the problems to the LLMs as well, such as shaping the problem to look more like a thing that they're alreay trained on, but I soon realized the limitations of that approach.

I think in a couple of decades, maybe earlier, that kind of thing will be commonplace. People training their own stuff from scratch, on cheap hardware. It will unleash an even more rewarding learning experience for those willing to go the extra mile.

I think you're missing that perspective. That's fine, by the way. You're totally cool and probably helping lots of people with your work. I support it, it allows people to understand better where LLMs currently can help and where they cannot.

replies(1): >>45689434 #
4. simonw ◴[] No.45689434{3}[source]
There aren't many tasks these days for which training or fine-tuning a model seems necessary to me.

One of the reasons I am so excited about the "skills" concept from Anthropic is that it helps emphasize how the latest generation of LLMs really can pick up new capabilities if you get them to read a single, carefully constructed markdown file.

replies(1): >>45689539 #
5. alganet ◴[] No.45689539{4}[source]
I'm trying to simplify the live-bootstrap project by either removing dependencies, reducing build time or making it more unattended (by automating the image creation steps, for example).

https://github.com/fosslinux/live-bootstrap/

Other efforts around the same problem are trying to make it more architecture independent or improve regenerations (re-building things like automake during the process).

It's free and open source, you're welcome to fork it and try your best with the aid of Claude. All you need is an x86 or x86-64 machine or qemu.

The project and other related repositories are already full of documentation in the markdown format and high quality commented code.

Here's a friendly primer on the problem:

https://www.youtube.com/watch?v=Fu3laL5VYdM

If you decide to help, please ask the maintainers if AI use is allowed beforehand. I'm OK with it, they might not be.