(dynomight.substack.com)

696 points crescit_eundo | 2 comments | 14 Nov 24 17:05 UTC | HN request time: 0.446s | source

Show context

anotherpaulg ◴[15 Nov 24 04:54 UTC] No.42144062[source]▶

>>42138289 (OP) #

I found a related set of experiments that include gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4.

Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.

https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

replies(1): >>42144150 #

shtack ◴[15 Nov 24 05:19 UTC] No.42144150[source]▶

>>42144062 #

I’d bet it’s using function calling out to a real chess engine. It could probably be proven with a timing analysis to see how inference time changes/doesn’t with number of tokens or game complexity.

replies(2): >>42144275 #>>42150589 #

1. vbarrielle ◴[15 Nov 24 20:27 UTC] No.42150589[source]▶

>>42144150 #

If it were calling to a real chess engine there would be no illegal moves.

replies(1): >>42153733 #

2. shtack ◴[16 Nov 24 02:55 UTC] No.42153733[source]▶

>>42150589 (TP) #

The instances of that happening are likely the LLM failing to call the engine for whatever reason and falling back to inference.

↑

Something weird is happening with LLMs and chess