Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.
So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.
(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)
If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.
Have each of the Markov chain's states be one of 10^81 possible sudoku grids (a 9x9 grid of digits 1-9 and blank), then calculate the 10^81-by-10^81 transition matrix that takes each incomplete grid to the valid complete grid containing the same numbers. If you want you could even have it fill one square at a time rather than jump right to the solution, though there's no need to.
Up to you what you do for ambiguous inputs (select one solution at random to give 1.0 probability in the transition matrix? equally weight valid solutions? have the states be sets of boards and map to set of all valid solutions?) and impossible inputs (map to itself? have the states be sets of boards and map to empty set?).
Could say that's "cheating" by pre-computing the answers and hard-coding them in a massive input-output lookup table, but to my understanding that's also the only sense in which there's equivalence between Markov chains and LLMs.
Edit: I see you added questions for the ambiguities but modulo those choices your solution will almost work b/c it is not extensionally equivalent entirely. The transition graph and solver are almost extensionally equivalent but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions.
If you want it to give all possible solutions at once, you can just expand the state space to the power-set of sudoku boards, such that the input board transitions to the state representing the set of valid solved boards.