Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
Clearly, there's more going on here.