(dynomight.substack.com)

696 points crescit_eundo | 2 comments | 14 Nov 24 17:05 UTC | HN request time: 0s | source

Show context

niobe ◴[15 Nov 24 00:40 UTC] No.42142885[source]▶

I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

replies(20): >>42142963 #>>42143021 #>>42143024 #>>42143060 #>>42143136 #>>42143208 #>>42143253 #>>42143349 #>>42143949 #>>42144041 #>>42144146 #>>42144448 #>>42144487 #>>42144490 #>>42144558 #>>42144621 #>>42145171 #>>42145383 #>>42146513 #>>42147230 #

computerex ◴[15 Nov 24 00:55 UTC] No.42142963[source]▶

>>42142885 #

Question here is why gpt-3.5-instruct can then beat stockfish.

replies(4): >>42142975 #>>42143081 #>>42143181 #>>42143889 #

fsndz ◴[15 Nov 24 00:57 UTC] No.42142975[source]▶

>>42142963 #

PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

replies(1): >>42142993 #

computerex ◴[15 Nov 24 01:00 UTC] No.42142993[source]▶

>>42142975 #

Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.

replies(2): >>42143059 #>>42144502 #

Filligree ◴[15 Nov 24 01:16 UTC] No.42143059[source]▶

>>42142993 #

OP had stockfish at its weakest preset.

replies(1): >>42143193 #

1. fsndz ◴[15 Nov 24 01:42 UTC] No.42143193[source]▶

>>42143059 #

Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16

replies(1): >>42143999 #

2. mannykannot ◴[15 Nov 24 04:35 UTC] No.42143999[source]▶

>>42143193 (TP) #

That is a very pertinent question, especially if Stockfish has been used to generate training data.

↑

Something weird is happening with LLMs and chess