Most active commenters

Something weird is happening with LLMs and chess

(dynomight.substack.com)

Show context

niobe ◴[15 Nov 24 00:40 UTC] No.42142885[source]▶

I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

replies(20): >>42142963 #>>42143021 #>>42143024 #>>42143060 #>>42143136 #>>42143208 #>>42143253 #>>42143349 #>>42143949 #>>42144041 #>>42144146 #>>42144448 #>>42144487 #>>42144490 #>>42144558 #>>42144621 #>>42145171 #>>42145383 #>>42146513 #>>42147230 #

1. computerex ◴[15 Nov 24 00:55 UTC] No.42142963[source]▶

>>42142885 #

Question here is why gpt-3.5-instruct can then beat stockfish.

replies(4): >>42142975 #>>42143081 #>>42143181 #>>42143889 #

2. fsndz ◴[15 Nov 24 00:57 UTC] No.42142975[source]▶

>>42142963 (TP) #

PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

replies(1): >>42142993 #

3. computerex ◴[15 Nov 24 01:00 UTC] No.42142993[source]▶

>>42142975 #

Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.

replies(2): >>42143059 #>>42144502 #

4. Filligree ◴[15 Nov 24 01:16 UTC] No.42143059{3}[source]▶

>>42142993 #

OP had stockfish at its weakest preset.

replies(1): >>42143193 #

5. bluGill ◴[15 Nov 24 01:20 UTC] No.42143081[source]▶

>>42142963 (TP) #

The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it

6. lukan ◴[15 Nov 24 01:40 UTC] No.42143181[source]▶

>>42142963 (TP) #

Cheating (using a internal chess engine) would be the obvious reason to me.

replies(2): >>42143214 #>>42165535 #

7. fsndz ◴[15 Nov 24 01:42 UTC] No.42143193{4}[source]▶

>>42143059 #

Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16

replies(1): >>42143999 #

8. TZubiri ◴[15 Nov 24 01:46 UTC] No.42143214[source]▶

>>42143181 #

Nope. Calls by api don't use functions calls.

replies(2): >>42143226 #>>42144027 #

9. permo-w ◴[15 Nov 24 01:48 UTC] No.42143226{3}[source]▶

>>42143214 #

that you know of

replies(1): >>42150883 #

10. shric ◴[15 Nov 24 04:08 UTC] No.42143889[source]▶

>>42142963 (TP) #

I'm actually surprised any of them manage to make legal moves throughout the game once out of book moves.

11. mannykannot ◴[15 Nov 24 04:35 UTC] No.42143999{5}[source]▶

>>42143193 #

That is a very pertinent question, especially if Stockfish has been used to generate training data.

12. girvo ◴[15 Nov 24 04:42 UTC] No.42144027{3}[source]▶

>>42143214 #

How can you prove this when talking about someones internal closed API?

13. golol ◴[15 Nov 24 06:54 UTC] No.42144502{3}[source]▶

>>42142993 #

You have to get the model to think in PGN data. It's crucial to use the exact PGN format it sae in its training data and to give it few shot examples.

14. TZubiri ◴[15 Nov 24 20:52 UTC] No.42150883{4}[source]▶

>>42143226 #

Sure. It's not hard to verify, in the user ui, function calls are very transparent.

And in the api, all of the common features like maths and search are just not there. You can implement them yourself.

You can compare with self hosted models like llama and the performance is quite similar.

You can also jailbreak and get shell into the container to get some further proof

replies(1): >>42157065 #

15. permo-w ◴[16 Nov 24 16:09 UTC] No.42157065{5}[source]▶

>>42150883 #

this is all just guesswork. it's a black box. you have no idea what post-processing they're doing on their end

16. nske ◴[17 Nov 24 17:40 UTC] No.42165535[source]▶

>>42143181 #

But in that case there shouldn't be any invalid moves, ever. Another tester found gpt-3.5-turbo-instruct to be suggesting at least one illegal move in 16% of the games (source: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ )

↑