Something weird is happening with LLMs and chess

Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"

Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.

Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.

One could try using DSPy for automatic prompt optimization.