The wall confronting large language models

(arxiv.org)

170 points PaulHoule | 5 comments | 03 Sep 25 11:40 UTC | HN request time: 1.115s | source

Show context

measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

jules ◴[03 Sep 25 21:24 UTC] No.45120573[source]▶

>>45120049 #

What does this predict about LLMs ability to win gold at the International Mathematical Olympiad?

replies(2): >>45120671 #>>45122931 #

godelski ◴[04 Sep 25 02:44 UTC] No.45122931[source]▶

>>45120573 #

Depends which question you're asking.

Ability to win a gold medal as if they were scored similarly to how humans are scored?

Ability to win a gold medal as determined by getting the "correct answer" to all the questions?

These are subtly two very different questions. In these kinds of math exams how you get to the answer matters more than the answer itself. i.e. You could not get high marks through divination. To add some clarity, the latter would be like testing someone's ability to code by only looking at their results to some test functions (oh wait... that's how we evaluate LLMs...). It's a good signal but it is far from a complete answer. It very much matters how the code generates the answer. Certainly you wouldn't accept code if it does a bunch of random computations before divining an answer.

The paper's answer to your question (assuming scored similarly to humans) is "Don’t count on it". Not a definitive "no" but they strongly suspect not.

replies(1): >>45123108 #

jules ◴[04 Sep 25 03:10 UTC] No.45123108[source]▶

>>45122931 #

The type of reasoning by the OP and the linked paper obviously does not work. The observable reality is that LLMs can do mathematical reasoning. A cursory interaction with state of the art LLMs makes this evident, as does their IMO gold medal scored like humans are. You cannot counter observable reality with generic theoretical considerations about Markov chains or pretraining scaling laws or floating point precision. The irony is that LLMs can explain why that type of reasoning is faulty:

> Any discrete-time computation (including backtracking search) becomes Markov if you define the state as the full machine configuration. Thus “Markov ⇒ no reasoning/backtracking” is a non sequitur. Moreover, LLMs can simulate backtracking in their reasoning chains. -- GPT-5

replies(2): >>45125259 #>>45125354 #

godelski ◴[04 Sep 25 09:35 UTC] No.45125354[source]▶

>>45123108 #

  > The observable reality is that LLMs can do mathematical reasoning

I still can't get these machines to reliably perform basic subtraction[0]. The result is stochastic, so I can get the right answer, but have yet to reproduce one where the actual logic is correct[1,2]. Both [1,2] perform the same mistake and in [2] you see it just say "fuck it, skip to the answer"

  > You cannot counter observable reality

I'd call [0,1,2] "observable". These types of errors are quite common, so maybe I'm not the one with lying eyes.

[0] https://chatgpt.com/share/68b95bf5-562c-8013-8535-b61a80bada...

[1] https://chatgpt.com/share/68b95c95-808c-8013-b4ae-87a3a5a42b...

[2] https://chatgpt.com/share/68b95cae-0414-8013-aaf0-11acd0edeb...

replies(1): >>45125387 #

FergusArgyll ◴[04 Sep 25 09:40 UTC] No.45125387[source]▶

>>45125354 #

Why don't you use a state of the art model? Are you scared it will get it right? Or are you just not aware of reasoning models in which case you should get to know the field

replies(2): >>45125428 #>>45127681 #

1. godelski ◴[04 Sep 25 09:47 UTC] No.45125428[source]▶

>>45125387 #

Careful there, without a /s people might think you're being serious.

replies(1): >>45125580 #

2. FergusArgyll ◴[04 Sep 25 10:15 UTC] No.45125580[source]▶

>>45125428 (TP) #

I am being serious, why don't you use a SOTA model?

replies(1): >>45129312 #

3. godelski ◴[04 Sep 25 16:50 UTC] No.45129312[source]▶

>>45125580 #

Sorry, I've just been hearing this response for years now... GPT-5 not SOTA enough for you all now? I remember when people told me to just use 3.5

  - Gemini 2.5 Pro[0], the top model on LLM Arena. This SOTA enough for you? It even hallucinated Python code!

  - Claude Opus 4.1, sharing that chat shares my name, so here's a screenshot[1]. I'll leave that one for you to check. 

  - Grok4 getting the right answer but using bad logic[2]

  - Kimi K2[3]

  - Mistral[4]

I'm sorry, but you can fuck off with your goal post moving. They all do it. Check yourself.

  > I am being serious

Don't lie to yourself, you never were

People like you have been using that copy-paste piss-poor logic since the GPT-3 days. The same exact error existed since those days on all those models just as it does today. You all were highly disingenuous then, and still are now. I know this comment isn't going to change your mind because you never cared about the evidence. You could have checked yourself! So you and your paperclip cult can just fuck off

[0] https://g.co/gemini/share/259b33fb64cc

[1] https://0x0.st/KXWf.png

[2] https://grok.com/s/c2hhcmQtNA%3D%3D_e15bb008-d252-4b4d-8233-...

[3] http://0x0.st/KXWv.png

[4] https://chat.mistral.ai/chat/8e94be15-61f4-4f74-be26-3a4289d...

replies(1): >>45130498 #

4. FergusArgyll ◴[04 Sep 25 18:23 UTC] No.45130498{3}[source]▶

>>45129312 #

That's very weird, before I wrote my comment I asked gpt5-thinking (yes, once) and it nailed it. I just assumed the rest would get it as well, gemini-2.5 is shocking (the code!) I hereby give you leave to be a curmudgeon for another year...

replies(1): >>45131235 #

5. godelski ◴[04 Sep 25 19:25 UTC] No.45131235{4}[source]▶

>>45130498 #

Try a few times and it'll happen. I don't think it took me more than 3 tries on any platform.

To convince me it is "reasoning", it needs to get the answer right consistently. Most attempts were actually about getting it to show its results. But pay close attention. GPT got the answer right several times but through incorrect calculations. Go check the "thinking" and see if it does a 11-9=2 calculation somewhere, I saw this >50% of the attempts. You should be able to reproduce my results in <5 minutes.

Forgive my annoyance, but we've been hearing the same argument you've made for years[0,1,2,3,4]. We're talking about models that have been reported as operating at "PhD Level" since the previous generation. People have constantly been saying "But I get the right answer" or "if you use X model it'll get it right" while missing the entire point. It never mattered if it got the answer right once, it matters that it can do it consistently. It matters how it gets the answer if you want to claim reasoning. There is still no evidence that LLMs can perform even simple math consistently, despite years of such claims[5]

[0] https://news.ycombinator.com/item?id=34113657

[1] https://news.ycombinator.com/item?id=36288834

[2] https://news.ycombinator.com/item?id=36089362

[3] https://news.ycombinator.com/item?id=37825219

[4] https://news.ycombinator.com/item?id=37825059

[5] Don't let your eyes trick you, not all those green squares are 100%... You'll also see many "look X model got it right!" in response to something tested multiple times... https://x.com/yuntiandeng/status/1889704768135905332

↑