Something weird is happening with LLMs and chess

1. fabiospampinato ◴[15 Nov 24 11:20 UTC] No.42145891[source]▶

It's probably worth to play around with different prompts and different board positions.

For context this [1] is the board position the model is being prompted on.

There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.

More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.

[1]: https://i.imgur.com/qRxalgH.png

replies(4): >>42146161 #>>42147006 #>>42147866 #>>42150105 #

2. Closi ◴[15 Nov 24 12:06 UTC] No.42146161[source]▶

>>42145891 (TP) #

Agree with this. A few prompt variants:

* What if you allow the model to do Chain of Thought (explicitly disallowed in this experiment)

* What if you explain the board position at each step to the model in the prompt, so it doesn't have to calculate/estimate it internally.

replies(1): >>42149903 #

3. fabiospampinato ◴[15 Nov 24 14:01 UTC] No.42147006[source]▶

>>42145891 (TP) #

Apparently I can find some matches for games that start like that between very strong players [1], so my hypothesis that the model may just be predicting bad moves on purpose seems wobbly, although having stockfish at the lowest level play as the supposedly very strong opponent may still be throwing the model off somewhat. In the charts the first few moves the model makes seem decent, if I'm interpreting these charts right, and after a few of those things seem to start going wrong.

Either way it's worth repeating the experiment imo, tweaking some of these variables (prompt guidance, stockfish strength, starting position, the name of the supposed players, etc.).

[1]: https://www.365chess.com/search_result.php?search=1&p=1&m=8&...

replies(1): >>42164340 #

4. spott ◴[15 Nov 24 15:39 UTC] No.42147866[source]▶

>>42145891 (TP) #

He was playing full games, not single moves.

5. int_19h ◴[15 Nov 24 19:13 UTC] No.42149903[source]▶

>>42146161 #

They also tested GPT-o1, which is always CoT. Yet it is still worse.

6. NiloCK ◴[15 Nov 24 19:35 UTC] No.42150105[source]▶

>>42145891 (TP) #

The experiment started from the first move of a game, and played each game fully. The position you linked was just an example of the format used to feed the game state to the model for each move.

What would "winning" or "losing" even mean if all of this was against a single move?

7. sjducb ◴[17 Nov 24 14:26 UTC] No.42164340[source]▶

>>42147006 #

Interesting thought the LLM isn’t trying to win, it’s trying to produce data like the input data. It’s quite rare for a very strong player to play a very weak one. If you feed it lots of weak moves it’ll best replicate the training data by following with weak moves.