(dynomight.substack.com)

696 points crescit_eundo | 1 comments | 14 Nov 24 17:05 UTC | HN request time: 0s | source

Show context

fsndz ◴[15 Nov 24 00:46 UTC] No.42142922[source]▶

wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess

I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out

replies(1): >>42142971 #

fsndz ◴[15 Nov 24 00:56 UTC] No.42142971[source]▶

>>42142922 #

PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close

"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"

https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

replies(3): >>42143260 #>>42143295 #>>42145596 #

janalsncm ◴[15 Nov 24 01:54 UTC] No.42143260[source]▶

>>42142971 #

> I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting

I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).

replies(1): >>42143574 #

fsndz ◴[15 Nov 24 02:53 UTC] No.42143574{3}[source]▶

>>42143260 #

Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16

replies(1): >>42149947 #

1. janalsncm ◴[15 Nov 24 19:19 UTC] No.42149947{4}[source]▶

>>42143574 #

Huh. Honestly, your answer makes more sense, LLMs shouldn’t be good at chess, and this anomaly looks more like a bug. Maybe the author should share his code so it can be replicated.

↑

Something weird is happening with LLMs and chess