Something weird is happening with LLMs and chess

1. snickerbockers ◴[15 Nov 24 08:31 UTC] No.42144943[source]▶

Does it ever try an illegal move? OP didn't mention this and I think it's inevitable that it should happen at least once, since the rules of chess are fairly arbitrary and LLMs are notorious for bullshitting their way through difficult problems when we'd rather they just admit that they don't have the answer.

replies(2): >>42145004 #>>42145793 #

2. sethherr ◴[15 Nov 24 08:42 UTC] No.42145004[source]▶

>>42144943 (TP) #

Yes, he discusses using a grammar to restrict to only legal moves

replies(4): >>42147380 #>>42148708 #>>42150800 #>>42152205 #

3. smatija ◴[15 Nov 24 11:05 UTC] No.42145793[source]▶

>>42144943 (TP) #

In my experience you are lucky if it manages to give you 10 legal moves in a row, e.g. https://news.ycombinator.com/item?id=41527143#41529024

4. topaz0 ◴[15 Nov 24 14:47 UTC] No.42147380[source]▶

>>42145004 #

Still an interesting direction of questioning. Maybe could be rephrased as "how much work is the grammar doing"? Are the results with the grammar very different than without? If/when a grammar is not used (like in the openai case), how many illegal moves does it try on average before finding a legal one?

replies(3): >>42147422 #>>42150017 #>>42151815 #

5. Jerrrrrrry ◴[15 Nov 24 14:52 UTC] No.42147422{3}[source]▶

>>42147380 #

an LLM would complain that their internal model does not refelct their current input/output.

Since LLM's knows people knock off/test/run afoul/mistakes can be made, it would then raise that as a possibility and likely inquire.

replies(1): >>42149652 #

6. ◴[15 Nov 24 17:05 UTC] No.42148708[source]▶

>>42145004 #

7. causal ◴[15 Nov 24 18:44 UTC] No.42149652{4}[source]▶

>>42147422 #

This isn't prompt engineering, it's grammar-constrained decoding. It literally cannot respond with anything but tokens that fulfill the grammar.

8. int_19h ◴[15 Nov 24 19:26 UTC] No.42150017{3}[source]▶

>>42147380 #

A grammar is really just a special case of the more general issue of how to pick a single token given the probabilities that the model spits out for every possible one. In that sense, filters like temperature / top_p / top_k are already hacks that "do the work" (since always taking the most likely predicted token does not give good results in practice), and grammars are just a more complicated way to make such decisions.

9. yshui ◴[15 Nov 24 20:46 UTC] No.42150800[source]▶

>>42145004 #

I suspect the models probably memorized some chess openings, and afterwards they are just playing random moves with the help of the grammar.

replies(1): >>42151787 #

10. gs17 ◴[15 Nov 24 22:11 UTC] No.42151787{3}[source]▶

>>42150800 #

I suspect that as well, however, 3.5-turbo-instruct has been noted by other people to do much better at generating legal chess moves than the other models. https://github.com/adamkarvonen/chess_gpt_eval gave models "5 illegal moves before forced resignation of the round" and 3.5 had very few illegal moves, while 4 lost most games due to illegal moves.

11. gs17 ◴[15 Nov 24 22:14 UTC] No.42151815{3}[source]▶

>>42147380 #

I'd be more interested in what the distribution of grammar-restricted predictions looks like compared to moves Stockfish says are good.

12. thaumasiotes ◴[15 Nov 24 22:53 UTC] No.42152205[source]▶

>>42145004 #

> he discusses using a grammar to restrict to only legal moves

Whether a chess move is legal isn't primarily a question of grammar. It's a question of the board state. "White king to a5" is a perfectly legal move, as long as the white king was next to a5 before the move, and it's white's turn, and there isn't a white piece in a5, and a5 isn't threatened by black. Otherwise it isn't.

"White king to a9" is a move that could be recognized and blocked by a grammar, but how relevant is that?