Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.
So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.
(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)
If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.
The fundamental autoregressive architecture is absolutely capable of backtracking… we generate next token probabilities, select a next token, then calculate probabilities for the token thereafter.
There is absolutely nothing stopping you from “rewinding” to an earlier token, making a different selection and replaying from that point. The basic architecture absolutely supports it.
Why then has nobody implemented it? Maybe, this kind of backtracking isn’t really that useful.