(dynomight.substack.com)

Show context

niobe ◴[15 Nov 24 00:40 UTC] No.42142885[source]▶

I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

replies(20): >>42142963 #>>42143021 #>>42143024 #>>42143060 #>>42143136 #>>42143208 #>>42143253 #>>42143349 #>>42143949 #>>42144041 #>>42144146 #>>42144448 #>>42144487 #>>42144490 #>>42144558 #>>42144621 #>>42145171 #>>42145383 #>>42146513 #>>42147230 #

computerex ◴[15 Nov 24 00:55 UTC] No.42142963[source]▶

>>42142885 #

Question here is why gpt-3.5-instruct can then beat stockfish.

replies(4): >>42142975 #>>42143081 #>>42143181 #>>42143889 #

1. lukan ◴[15 Nov 24 01:40 UTC] No.42143181[source]▶

>>42142963 #

Cheating (using a internal chess engine) would be the obvious reason to me.

replies(2): >>42143214 #>>42165535 #

2. TZubiri ◴[15 Nov 24 01:46 UTC] No.42143214[source]▶

>>42143181 (TP) #

Nope. Calls by api don't use functions calls.

replies(2): >>42143226 #>>42144027 #

3. permo-w ◴[15 Nov 24 01:48 UTC] No.42143226[source]▶

>>42143214 #

that you know of

replies(1): >>42150883 #

4. girvo ◴[15 Nov 24 04:42 UTC] No.42144027[source]▶

>>42143214 #

How can you prove this when talking about someones internal closed API?

5. TZubiri ◴[15 Nov 24 20:52 UTC] No.42150883{3}[source]▶

>>42143226 #

Sure. It's not hard to verify, in the user ui, function calls are very transparent.

And in the api, all of the common features like maths and search are just not there. You can implement them yourself.

You can compare with self hosted models like llama and the performance is quite similar.

You can also jailbreak and get shell into the container to get some further proof

replies(1): >>42157065 #

6. permo-w ◴[16 Nov 24 16:09 UTC] No.42157065{4}[source]▶

>>42150883 #

this is all just guesswork. it's a black box. you have no idea what post-processing they're doing on their end

7. nske ◴[17 Nov 24 17:40 UTC] No.42165535[source]▶

>>42143181 (TP) #

But in that case there shouldn't be any invalid moves, ever. Another tester found gpt-3.5-turbo-instruct to be suggesting at least one illegal move in 16% of the games (source: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ )

↑

Something weird is happening with LLMs and chess