Reasoning is not model improvement

(manidoraisamy.com)

1. QueensGambit ◴[23 Oct 25 15:39 UTC] No.45683114[source]▶

Hi HN, OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.

1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?

3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?

Would love to know your thoughts!

replies(10): >>45686080 #>>45686164 #>>45686265 #>>45686295 #>>45686359 #>>45686379 #>>45686464 #>>45686479 #>>45686558 #>>45686559 #

2. Workaccount2 ◴[23 Oct 25 19:43 UTC] No.45686080[source]▶

>>45683114 (TP) #

I don't think HN is a place for fact checking your blog.

replies(1): >>45686167 #

3. lawlessone ◴[23 Oct 25 19:48 UTC] No.45686164[source]▶

>>45683114 (TP) #

All i can say as someone sorta lay is that math isn't an LLM's strength. Having them defer calculations to calculators or python code seems better than it guessing that 1+1 = 2 because it's past data says 1+1 = 2

4. Terr_ ◴[23 Oct 25 19:48 UTC] No.45686167[source]▶

>>45686080 #

On the other hand, "that's technically wrong" is an extraordinarily popular staple of online geek discussion. :p

5. cpa ◴[23 Oct 25 19:56 UTC] No.45686265[source]▶

>>45683114 (TP) #

I don't think 2 is true: when OpenAI model won a gold medal in the math olympiads, it did so without tools or web search, just pure inference. Such a feat definitely would not have happened with o1.

replies(2): >>45686389 #>>45686475 #

6. simonw ◴[23 Oct 25 19:59 UTC] No.45686295[source]▶

>>45683114 (TP) #

1 isn't true. o1 doesn't have access to a Python interpreter unless you explicitly grant it access.

If you call the OpenAI API for o1 and ask it to multiply two large numbers it cannot use Python to help it.

Try this:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        }
      }'

Here's what I got back just now: https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e...

o1 correctly answered the multiplication by running a long multiplication process entirely through reasoning tokens.

replies(4): >>45686406 #>>45686717 #>>45686779 #>>45687000 #

7. anonymoushn ◴[23 Oct 25 20:05 UTC] No.45686359[source]▶

>>45683114 (TP) #

I don't really know what you mean by "preserve semantic meaning at the word level." The significant misunderstanding about tokenization present elsewhere in the article is concerning, given that the proposed path forward is to do with replacing tokenization somehow.

replies(1): >>45687203 #

8. ACCount37 ◴[23 Oct 25 20:06 UTC] No.45686379[source]▶

>>45683114 (TP) #

Wrong on every count, basically.

1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.

2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.

3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.

As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.

replies(2): >>45687282 #>>45695425 #

9. MoltenMan ◴[23 Oct 25 20:07 UTC] No.45686389[source]▶

>>45686265 #

True, but aren't the math (and competitive programming) achievements a bit different? They're specific models heavily RL'd on competition math problems. Obviously still ridiculously impressive, but if you haven't done competition math or programming before it's much more memorization of techniques than you might expect and it's much easier to RL on.

10. ◴[23 Oct 25 20:08 UTC] No.45686406[source]▶

>>45686295 #

11. XenophileJKO ◴[23 Oct 25 20:13 UTC] No.45686464[source]▶

>>45683114 (TP) #

Point 2 is 1000% not true, the models have both gotten better at the overall act of coding, but have also gotten WAY better at USING tools. This isn't tool orchestration frameworks, this is knowing how and when to use tools effectively and it is largely inside the model. I would also say this is a fundamental model capability.

This improved think->act->sense loop that they now form, exponentially increases the possible utility of the models. We are just starting to see this with gpt-5 and the 4+ series of Claude models.

replies(1): >>45686805 #

12. simonw ◴[23 Oct 25 20:14 UTC] No.45686475[source]▶

>>45686265 #

Yeah, I confirmed this at the time. Neither OpenAI nor Gemini used tools as part of their IMO gold medal performances.

Here's OpenAI's tweet about this: https://twitter.com/SebastienBubeck/status/19465776504050567...

> Just to spell it out as clearly as possible: a next-word prediction machine (because that's really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.

My notes: https://simonwillison.net/2025/Jul/19/openai-gold-medal-math...

They DID use tools for the International Collegiate Programming Contest (ICPC) programming one though: https://twitter.com/ahelkky/status/1971652614950736194

> For OpenAI, the models had access to a code execution sandbox, so they could compile and test out their solutions. That was it though; no internet access.

replies(1): >>45686838 #

13. mirekrusin ◴[23 Oct 25 20:14 UTC] No.45686479[source]▶

>>45683114 (TP) #

Reasoning model doesn't imply tool calling – those shouldn't be conflated.

Reasoning just means more implicit chain-of-thought. It can be emulated by non reasoning model by explicitly constructing prompt to perform longer step by step thought process. With reasoning models it just happens implicitly, some models allow for control over reasoning effort with special tokens. Those models are simply fine tuned to do it themselves without explicit dialogue from the user.

Tool calling happens primarily on the client side. Research/web access mode etc made available by some providers (based on tool calling that they handle themselves) is not a property of a model, can be enabled on any model.

Nothing plateaued from where I'm standing – new models are being trained, releases happen frequently with impressive integration speed. New models outperform previous ones. Models gain multi modality etc.

Regarding alternative architectures – there are new ones proposed all the time. It's not easy to verify all of them at scale. Some ideas that are extending current state of art architectures end up in frontier models - but it takes time to train so lag does exist. There are also a lot of improvements that are hidden from public by commercial companies.

14. Legend2440 ◴[23 Oct 25 20:20 UTC] No.45686558[source]▶

>>45683114 (TP) #

>I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?

Both reasoning and non-reasoning models may choose to use the Python interpreter to solve math problems. This isn't hidden from the user; it will show the interpreter ("Analyzing...") and you can click on it to see the code it ran.

It can also solve math problems by working through them step-by-step. In this case it will do long multiplication using the pencil-and-paper method, and it will show its work.

15. mxkopy ◴[23 Oct 25 20:21 UTC] No.45686559[source]▶

>>45683114 (TP) #

Not affiliated with anyone, but I think the likes of OptNet (differentiable constraint optimization) are soon going to play a role in developing AI with precise deductive reasoning.

More broadly I think what we’re looking for at the end of the day, AGI, is going come about from a diaspora of methods capturing the diverse aspects of what we recognize as intelligence. ‘Precise deductive reasoning’ is one capability out of many. Attention isn’t all you need, neither is compression, convex programming, what have you. The perceived “smoothness” or “unity” of our intelligence is an illusion like virtual memory hiding cache, and building it is going to look a lot more like stitching these capabilities together than deriving some deep and elegant equation.

16. alganet ◴[23 Oct 25 20:33 UTC] No.45686717[source]▶

>>45686295 #

I see this:

> "tool_choice": "auto"

> "parallel_tool_calls": true

Can you remake the API call explicitly asking it to not perform any tool calls?

replies(1): >>45686747 #

17. simonw ◴[23 Oct 25 20:35 UTC] No.45686747{3}[source]▶

>>45686717 #

I'm doing that here. It only makes tool calls if you give it a JSON list of tools it can call.

Those are its default settings whether or not there are tools configured. You can set tool_choice to the name of a specific tool in order to force it to use that tool.

I added my comment here to show an example of an API call with Python enabled: https://news.ycombinator.com/item?id=45686779

Update: Looks like you can add "tool_choice": "none" to prevent even tools you have configured from being called. https://platform.openai.com/docs/api-reference/responses/cre...

replies(1): >>45686837 #

18. simonw ◴[23 Oct 25 20:38 UTC] No.45686779[source]▶

>>45686295 #

I know this isn't using tools (e.g. the Python interpreter) because you have to turn those on explicitly. That's not actually supported for o1 in the API but you can do it for GPT-5 like this:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "gpt-5",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        },
        "tools": [
          {
            "type": "code_interpreter",
            "container": {"type": "auto"}
          }
        ]
      }'

Here's the response: https://gist.github.com/simonw/c53c373fab2596c20942cfbb235af...

Note this bit where the code interpreter Python tool is called:

    {
      "id": "rs_080a5801ca14ad990068fa91f2779081a0ad166ee263153d98",
      "type": "reasoning",
      "summary": [
        {
          "type": "summary_text",
          "text": "**Calculating large product**\n\nI see that I need to compute the product of two large numbers, which involves big integer multiplication. It\u2019s a straightforward task, and given that I can use Python, that seems like the best route to avoid any potential errors. The user specifically asked for this multiplication, so I\u2019ll go ahead and use the python tool for accurate analysis. Let\u2019s get started on that!"
        }
      ]
    },
    {
      "id": "ci_080a5801ca14ad990068fa91f4dbe481a09eb646af049541c6",
      "type": "code_interpreter_call",
      "status": "completed",
      "code": "a = 87654321\r\nb = 98765432\r\na*b",
      "container_id": "cntr_68fa91f12f008191a359f1eeaed561290c438cc21b3fc083",
      "outputs": null
    }

19. emp17344 ◴[23 Oct 25 20:40 UTC] No.45686805[source]▶

>>45686464 #

Yes, the models have gotten better at using tools because tech companies have poured an insane amount of money into improving tools and integrating them with LLMs. Is this because the models have actually improved, or because the tools and integration methods have improved? I don’t think anyone actually knows.

replies(2): >>45687139 #>>45687155 #

20. alganet ◴[23 Oct 25 20:42 UTC] No.45686837{4}[source]▶

>>45686747 #

There are three possible generic values for `tool_choice`: none, auto and required.

Can you remake the call explicitly using the value `none`?

Maybe it's not using Python, but it's using something else. I think it's a good test. If you're right, then the response shouldn't change.

Update: `auto` is ambiguous. It doesn't say whether is picking from your selection of tools or the pool of all available tools. Explicit is better than implicit. I think you should do the call with `none`, it can't hurt and it can prove me wrong.

replies(1): >>45686962 #

21. emp17344 ◴[23 Oct 25 20:43 UTC] No.45686838{3}[source]▶

>>45686475 #

We still have next to no real information on how the models achieved the gold medal. It’s a little early to be confirming anything, especially when the main source is a Twitter thread initiated by a company known for “exaggerating” the truth.

replies(1): >>45687260 #

22. simonw ◴[23 Oct 25 20:52 UTC] No.45686962{5}[source]▶

>>45686837 #

I just ran it like this and got the same correct result:

    curl https://api.openai.com/v1/responses \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -d '{
        "model": "o1",
        "input": "Multiply 87654321 × 98765432",
        "reasoning": {
          "effort": "medium",
          "summary": "detailed"
        },
        "tool_choice": "none"
      }'

Result: https://gist.github.com/simonw/52888b6546dcfc6a9dcc75bcf171b...

I promise you it is not using anything else. It is performing long multiplication entirely through model reasoning.

(I suggest getting your own OpenAI API key so you can try these things yourself.)

replies(2): >>45687099 #>>45687291 #

23. ◴[23 Oct 25 20:55 UTC] No.45687000[source]▶

>>45686295 #

24. simonw ◴[23 Oct 25 21:03 UTC] No.45687099{6}[source]▶

>>45686962 #

And if you're still not convinced the definitive evidence is to run a reasoning model on your own machine.

OpenAI's gpt-oss-20b is a 12GB download for LM Studio from https://lmstudio.ai/models/openai/gpt-oss-20b

It turns out it's powerful enough to solve this. Here's the thinking trace:

    We need product. Compute. Use approximate? Let's compute exactly. 87,654,321 * 98,765,432.

    We can use multiplication: Let's do 87,654,321 * 98,765,432.

    Compute using decomposition: 98,765,432 = 100,000,000 -1,234,568.

    So product = 87,654,321100,000,000 - 87,654,3211,234,568.

    First term: 8,765,432,100,000,000? Wait multiply: 87,654,321 *100,000,000 = 8,765,432,100,000,000.

    Now compute second term: 87,654,321 *1,234,568.

    Compute 1,234,568*87,654,321. Use maybe multiply 87,654,321 1,234,568 = 87,654,321(1,200,000+34,568).

    First: 1,200,000 = 87,654,32112100,000. 87,654,32112=1,051,851,852? Wait compute: 87,654,321*10=876,543,210; *2=175,308,642; sum =1,051,851,852. Then times100,000 =>105,185,185,200,000.

    Second: 34,568. Compute 87,654,32134,568 = 87,654,321*(30,000+4,568).

    30,000: 87,654,3213=262,962,963; times10,000 =>2,629,629,630,000.

    4,568: compute 87,654,3214,568 = *4000 + *500 + *60 + *8.

    4000: 87,654,3214=350,617,284; times1000 =>350,617,284,000.

    *500: *5=438,271,605; times100 =>43,827,160,500.

    *60: *6=525,925,926; times10 =>5,259,259,260.

    *8: 700,834,568? Wait *8=700,834,568.

    Sum these: 350,617,284,000
    +43,827,160,500 =394,444,444,500
    +5,259,259,260 =399,703,703,760
    +700,834,568 =400,404,538,328.

    Now add *30,000 part: 2,629,629,630,000 +400,404,538,328 =3,030,034,168,328.

    Now add to first part: 105,185,185,200,000 +3,030,034,168,328 =108,215,219,368,328.

    Thus product second term =108,215,219,368,328.

    Now subtract from first term: 8,765,432,100,000,000 -108,215,219,368,328 = 8,657,216,880,631,672.

    So answer: 8,657,216,880,631,672.

And a screenshot: https://gist.github.com/simonw/a8929c0df5f204981652871555420...

replies(1): >>45687769 #

25. XenophileJKO ◴[23 Oct 25 21:04 UTC] No.45687139{3}[source]▶

>>45686805 #

The models have improved. They are using "arbitrary tools" better.

replies(1): >>45687549 #

26. remich ◴[23 Oct 25 21:05 UTC] No.45687155{3}[source]▶

>>45686805 #

But, isn't improving tools and the LLM's integration with them improving the model?

Caveat that we don't fully understand how human intelligence works, but with humans it's generally true that skills are not static or siloed. Improving in one area can generate dividends in others. It's like how some professional football players improve their games by taking ballet lessons. Two very different skills, but the incorporation of one improves the other as well as the whole.

I would argue that narrowly focusing on LLM performance via benchmarks before tool use is incorporated is interesting, but not particularly relevant to whether they are transformative, or even useful, as products.

27. remich ◴[23 Oct 25 21:09 UTC] No.45687203[source]▶

>>45686359 #

Right, words don't have semantic meaning on their own, that meaning is derived from surrounding context. "Cat" is both an animal and a bash command.

28. simonw ◴[23 Oct 25 21:14 UTC] No.45687260{4}[source]▶

>>45686838 #

If you're not going to believe researchers when they tell you how they did something then sure, we don't know how they did it.

Given how much bad press OpenAI got just last week[1] when one one of their execs clumsily (and I would argue misleadingly) described a model achievement and then had to walk it back amid widespread headlines about their dishonesty, those researchers have a VERY strong incentive to tell the truth.

[1] https://techcrunch.com/2025/10/19/openais-embarrassing-math/

replies(1): >>45687774 #

29. throwthrowrow ◴[23 Oct 25 21:15 UTC] No.45687282[source]▶

>>45686379 #

Question #1 was on the model's ability to handle arithmetic. The answer to question seems to be unrelated, at least to me: "you can enable or disable tool use in most APIs".

The original question still stands: do recent LLMs have an inherent knowledge of arithmetic, or do they have to offload the calculation to some other non-LLM system?

replies(2): >>45687577 #>>45688727 #

30. alganet ◴[23 Oct 25 21:16 UTC] No.45687291{6}[source]▶

>>45686962 #

Looks legit.

I can see this call now has a lot more tokens for the reasoning steps. Maybe that's normal variance though.

(I don't have a particular interest in proving or disproving LLM things, so there's no incentive for me to get a key). There was an ambiguous point in the "proof", I just highlighted it.

replies(1): >>45687320 #

31. simonw ◴[23 Oct 25 21:18 UTC] No.45687320{7}[source]▶

>>45687291 #

If you want to write about LLMs I really strongly recommend getting an API key for the major vendors! It's really useful being able to run quick experiments like this one if you want to develop a deeper understanding of what they can and cannot do.

You can also get an account with something like https://openrouter.ai/ which gives you one key to use with multiple different backends.

Or use GitHub Models which gives you free albeit limited access to a bunch at once. https://github.com/marketplace/models

replies(1): >>45687514 #

32. alganet ◴[23 Oct 25 21:33 UTC] No.45687514{8}[source]▶

>>45687320 #

I want to write about thinking crictically, specially but not limited to a software development context.

Lots of people don't have resources to invest in LLMs (either self-hosted or not). They rely on what other people say. And people get caught in the hype all the time. As it turns out, lots of hype nowadays is around LLMs, so that's where I'll go.

I was skeptic about LK99. Didn't had the resources to independently verify it. It doesn't mean I don't believe in superconductors or that I should have no say in it.

Some of that hype will be justified, some will not. And that's exactly what I expect from this kind of technology.

replies(1): >>45687554 #

33. emp17344 ◴[23 Oct 25 21:37 UTC] No.45687549{4}[source]▶

>>45687139 #

I don’t know what you mean, because arbitrary tools don’t integrate with LLMs in the first place. Are you referring to MCP?

34. simonw ◴[23 Oct 25 21:37 UTC] No.45687554{9}[source]▶

>>45687514 #

At this point most of the top tier LLMs are available for free across most of the world. If people aren't experimenting with LLMs it's not due to financial cost, it's due to either time constraints (figuring this stuff out does take a bunch of work) or because they find the field either uninteresting or downright scary (or both).

replies(1): >>45687809 #

35. ACCount37 ◴[23 Oct 25 21:39 UTC] No.45687577{3}[source]▶

>>45687282 #

The knowledge was never the bottleneck for that, not since the days of GPT-3. The ability to execute on it was.

Which includes, among other things, the underappreciated metacognitive skill of "being able to decide when to do math quick and dirty, in one forward pass, and when to write it out explicitly and solve it step by step".

Today's frontier LLMs can do that. A lot of training for "reasoning" is just training for "execute on your knowledge reliably". They usually can solve math problems with no tool calls. But they will tool call for more complex math when given an option to.

36. photonthug ◴[23 Oct 25 21:53 UTC] No.45687769{7}[source]▶

>>45687099 #

Thanks for doing this. OpenAI is not in fact open, so referencing their claims as obviously true on anything else is just a non-starter. Counterpoint though, it's been a while since I've run this kind of experiment locally, so I started one too. For reasoning I only have qwen3:latest and I won't clutter the thread with the output, but it's complete junk.

To summarize, with large numbers it goes nuts trying to find a trick or shortcut. After I cut off dead-ends in several trials, it always eventually considers long form addition, then ultimately rejects it as "tedious" and starts looking for "patterns". Wait, let me use the standard multiplication algorithm step by step, oh that's a lot of steps, break it down into parts. Let me think. Over ~45 minutes of thinking (I'm on CPU), but it basically cannot follow one strategy long enough to complete the work even if landed on a sensible approach.

For multiplying two-digit numbers, it does better. Starts using the "manual way", messes up certain steps, then gets the right answer for sub-problems anyway because obviously those are memoized somewhere. But at least once, it got the correct answer with the correct approach.

I think this raises the question, if you were to double the size of your input numbers and let the more powerful local model answer, could it still perform the process? Does that stop working for any reason at some point before the context window overflows?

37. emp17344 ◴[23 Oct 25 21:54 UTC] No.45687774{5}[source]▶

>>45687260 #

Any company will apologize when they receive bad press. That’s basic corporate PR, not integrity.

replies(1): >>45687806 #

38. simonw ◴[23 Oct 25 21:57 UTC] No.45687806{6}[source]▶

>>45687774 #

It illustrates that there is a real risk to lying about research results: if you get caught it's embarrassing.

It's also worth taking professional integrity into account. Even if OpenAI's culture didn't value the truth individual researchers still care about being honest.

replies(1): >>45687880 #

39. alganet ◴[23 Oct 25 21:57 UTC] No.45687809{10}[source]▶

>>45687554 #

I think you're missing something here.

I can invest lots of time in Linux, for example. I don't know how to write a driver for it, but I know I could learn how to do it. If there's a bug in a driver, there's nothing stopping me except my own will to learn. I can also do it in a potato, or my phone.

I can experiment with free tier LLMs, but that's as far as I will go. It's not just about me, that is as far as 99% of the developers will go.

So, it's not uninteresting because it's boring or something. It's uninteresting because it puts a price on learning. That horizon of "if there's a bug in it, I can fix it" is severely limited. That's a price most free software developers are not considering worthy. There's a lot of us.

replies(1): >>45687884 #

40. emp17344 ◴[23 Oct 25 22:02 UTC] No.45687880{7}[source]▶

>>45687806 #

This exact statement could be said about literally any corporation or organization. And yet, corporations still lie and mislead, because deception helps you make money and acquire funding.

In OpenAI’s case, this isn’t exactly the first time they’ve been caught doing something ethically misguided:

https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...

replies(1): >>45688978 #

41. simonw ◴[23 Oct 25 22:02 UTC] No.45687884{11}[source]▶

>>45687809 #

I don't understand, what am I missing?

I love learning about software. That's why I'm leaning so heavily on LLMs these days - they let me learn so much faster, and let me dig into whole new areas that previously I would never have considered experimenting with.

Just this week LLMs helped me figure out how to run Perl inside WebAssembly in a browser... and then how to compile 25-year-old C code to run in WebAssembly in the browser too. https://simonwillison.net/2025/Oct/22/sloccount-in-webassemb...

If I'd done this without LLMs I might have learned more of the underlying details... but realistically I wouldn't have done this at all, because my interest in Perl and C in WebAssembly is not strong enough to justify investing more than a few hours of effort.

replies(1): >>45689125 #

42. Terr_ ◴[23 Oct 25 23:21 UTC] No.45688727{3}[source]▶

>>45687282 #

Some nice charts here [0], which IMO means LLMs are getting very good at guessing answers to certain arithmetic operations, but they don't actually perform it in a logical fashion.

[0] https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator...

43. simonw ◴[23 Oct 25 23:48 UTC] No.45688978{8}[source]▶

>>45687880 #

That story feels very different to me from straight up lying about whether a mathematical competition result used tools or not.

44. alganet ◴[24 Oct 25 00:06 UTC] No.45689125{12}[source]▶

>>45687884 #

I would love to train an LLM from scratch to help me with some problems that they're not good at, but I can't, because it costs thousands of dollars to do so. You probably can't either, or can just in a very limited capacity (agents, or maybe LoRa).

A while back, I didn't even knew those problems existed. It took me a while to understand them and why they're interesting and lots of people spend time on them.

I have tried to adapt the problems to the LLMs as well, such as shaping the problem to look more like a thing that they're alreay trained on, but I soon realized the limitations of that approach.

I think in a couple of decades, maybe earlier, that kind of thing will be commonplace. People training their own stuff from scratch, on cheap hardware. It will unleash an even more rewarding learning experience for those willing to go the extra mile.

I think you're missing that perspective. That's fine, by the way. You're totally cool and probably helping lots of people with your work. I support it, it allows people to understand better where LLMs currently can help and where they cannot.

replies(1): >>45689434 #

45. simonw ◴[24 Oct 25 00:49 UTC] No.45689434{13}[source]▶

>>45689125 #

There aren't many tasks these days for which training or fine-tuning a model seems necessary to me.

One of the reasons I am so excited about the "skills" concept from Anthropic is that it helps emphasize how the latest generation of LLMs really can pick up new capabilities if you get them to read a single, carefully constructed markdown file.

replies(1): >>45689539 #

46. alganet ◴[24 Oct 25 01:06 UTC] No.45689539{14}[source]▶

>>45689434 #

I'm trying to simplify the live-bootstrap project by either removing dependencies, reducing build time or making it more unattended (by automating the image creation steps, for example).

https://github.com/fosslinux/live-bootstrap/

Other efforts around the same problem are trying to make it more architecture independent or improve regenerations (re-building things like automake during the process).

It's free and open source, you're welcome to fork it and try your best with the aid of Claude. All you need is an x86 or x86-64 machine or qemu.

The project and other related repositories are already full of documentation in the markdown format and high quality commented code.

Here's a friendly primer on the problem:

https://www.youtube.com/watch?v=Fu3laL5VYdM

If you decide to help, please ask the maintainers if AI use is allowed beforehand. I'm OK with it, they might not be.

47. vrighter ◴[24 Oct 25 15:06 UTC] No.45695425[source]▶

>>45686379 #

3 years in? How long had you been hibernating for 3 years ago?

replies(1): >>45696580 #

48. ACCount37 ◴[24 Oct 25 16:59 UTC] No.45696580{3}[source]▶

>>45695425 #

People in the industry started saying "oh shit this might be big" at a point between GPT-1 and GPT-2, but there were plenty of naysayers too. It only hit the mainstream with ChatGPT.

Which was also when the capabilities of LLMs became completely impossible to either ignore or excuse as "just matching seen data". But that was, in practice, solvable simply by increasing the copium intake.

↑