Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.
So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.
(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)
If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.
Take a finite tape Turing machine with N states and tape length T and N^T total possible tape states.
Now consider that you have a probability for each state instead of a definite state. The transitions of the Turing machine induce transitions of the probabilities. These transitions define a Markov chain on a N^T dimensional probability space.
Is this useful? Absolutely not. It's just a trivial rewriting. But it shows that high dimensional spaces are extremely powerful. You can trade off sophisticated transition rules for high dimensionality.
You _can_ continue this line of thought though in more productive directions. E.g. what if the input of your machine is genuinely uncertain? What if the transitions are not precise but slightly noisy? You'd expect that the fundamental capabilities of a noisy machine wouldn't be that much worse than those of a noiseless ones (over finite time horizons). What if the machine was built to be noise resistant in some way?
All of this should regularize the Markov chain above. If it's more regular you can start thinking about approximating it using a lower rank transition matrix.
The point of this is not to say that this is really useful. It's to say that there is no reason in my mind to dismiss the purely mathematical rewriting as entirely meaningless in practice.