Something weird is happening with LLMs and chess

My understanding of this is the following: All the bad models are chat models, somehow "generation 2 LLMs" which are not just text completion models but instead trained to behave as a chatting agent. The only good model is the only "generation 1 LLM" here which is gpt-3.5-turbo-instruct. It is a straight forward text completion model. If you prompt it to "get in the mind" of PGN completion then it can use some kind of system 1 thinking to give a decent approximation of the PGN Markov process. If you attempt to use a chat model it doesn't work since these these stochastic pathways somehow degenerate during the training to be a chat agent. You can however play chess with system 2 thinking, and the more advanced chat models are trying to do that and should get better at it while still being bad.