←back to thread

688 points crescit_eundo | 6 comments | | HN request time: 0.21s | source | bottom
Show context
fsndz ◴[] No.42142922[source]
wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess

I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out

replies(1): >>42142971 #
1. fsndz ◴[] No.42142971[source]
PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close

"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"

https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

replies(3): >>42143260 #>>42143295 #>>42145596 #
2. janalsncm ◴[] No.42143260[source]
> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting

I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).

replies(1): >>42143574 #
3. ◴[] No.42143295[source]
4. fsndz ◴[] No.42143574[source]
Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
replies(1): >>42149947 #
5. tedsanders ◴[] No.42145596[source]
Your issue is that the performance of these models at chess is incredibly sensitive to the prompt. If you have gpt-3.5-turbo-instruction complete a PGN transcript, then you'll see performance in the 1800 Elo range. If you ask in English or diagram the board, you'll see vastly degraded performance.

Unlike people, how you ask the question really really affects the output quality.

6. janalsncm ◴[] No.42149947{3}[source]
Huh. Honestly, your answer makes more sense, LLMs shouldn’t be good at chess, and this anomaly looks more like a bug. Maybe the author should share his code so it can be replicated.