The wall confronting large language models

(arxiv.org)

170 points PaulHoule | 5 comments | 03 Sep 25 11:40 UTC | HN request time: 0s | source

Show context

measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

Certhas ◴[03 Sep 25 20:48 UTC] No.45120259[source]▶

>>45120049 #

I don't understand what point you're hinting at.

Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.

So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.

(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)

replies(5): >>45120313 #>>45120341 #>>45120344 #>>45123837 #>>45124441 #

measurablefunc ◴[03 Sep 25 20:57 UTC] No.45120344[source]▶

>>45120259 #

What part about backtracking is baseless? Typical Prolog interpreters can be implemented in a few MBs of binary code (the high level specification is even simpler & can be in a few hundred KB)¹ but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.

¹https://en.wikipedia.org/wiki/Warren_Abstract_Machine

replies(3): >>45120516 #>>45121626 #>>45124764 #

skissane ◴[03 Sep 25 23:42 UTC] No.45121626[source]▶

>>45120344 #

> but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

The fundamental autoregressive architecture is absolutely capable of backtracking… we generate next token probabilities, select a next token, then calculate probabilities for the token thereafter.

There is absolutely nothing stopping you from “rewinding” to an earlier token, making a different selection and replaying from that point. The basic architecture absolutely supports it.

Why then has nobody implemented it? Maybe, this kind of backtracking isn’t really that useful.

replies(2): >>45121703 #>>45124591 #

1. measurablefunc ◴[03 Sep 25 23:51 UTC] No.45121703[source]▶

>>45121626 #

Where is this spelled out formally and proven logically?

replies(1): >>45121936 #

2. skissane ◴[04 Sep 25 00:22 UTC] No.45121936[source]▶

>>45121703 (TP) #

LLM backtracking is an active area of research, see e.g.

https://arxiv.org/html/2502.04404v1

https://arxiv.org/abs/2306.05426

And I was wrong that nobody has implemented it, as these papers prove people have… it is just the results haven’t been sufficiently impressive to support the transition from the research lab to industrial use - or at least, not yet

replies(2): >>45122006 #>>45122999 #

3. measurablefunc ◴[04 Sep 25 00:31 UTC] No.45122006[source]▶

>>45121936 #

> Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40% compared to the optimal-path supervised fine-tuning method.

4. afiori ◴[04 Sep 25 02:54 UTC] No.45122999[source]▶

>>45121936 #

I would expect to see something like this soonish as around now we are seeing the end of training scaling and the beginning of inference scaling

replies(1): >>45123196 #

5. foota ◴[04 Sep 25 03:24 UTC] No.45123196{3}[source]▶

>>45122999 #

This is a neat observation, training has been optimized to hell and inference is just beginning.

↑