Something weird is happening with LLMs and chess

(dynomight.substack.com)

Show context

fsndz ◴[15 Nov 24 00:46 UTC] No.42142922[source]▶

wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess

I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out

replies(1): >>42142971 #

1. fsndz ◴[15 Nov 24 00:56 UTC] No.42142971[source]▶

>>42142922 #

PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close

"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"

https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

replies(3): >>42143260 #>>42143295 #>>42145596 #

2. janalsncm ◴[15 Nov 24 01:54 UTC] No.42143260[source]▶

>>42142971 (TP) #

> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting

I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).

replies(1): >>42143574 #

3. ◴[15 Nov 24 02:00 UTC] No.42143295[source]▶

>>42142971 (TP) #

4. fsndz ◴[15 Nov 24 02:53 UTC] No.42143574[source]▶

>>42143260 #

Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16

replies(1): >>42149947 #

5. tedsanders ◴[15 Nov 24 10:33 UTC] No.42145596[source]▶

>>42142971 (TP) #

Your issue is that the performance of these models at chess is incredibly sensitive to the prompt. If you have gpt-3.5-turbo-instruction complete a PGN transcript, then you'll see performance in the 1800 Elo range. If you ask in English or diagram the board, you'll see vastly degraded performance.

Unlike people, how you ask the question really really affects the output quality.

6. janalsncm ◴[15 Nov 24 19:19 UTC] No.42149947{3}[source]▶

>>42143574 #

Huh. Honestly, your answer makes more sense, LLMs shouldn’t be good at chess, and this anomaly looks more like a bug. Maybe the author should share his code so it can be replicated.

↑