My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
replies(2):
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation