Something weird is happening with LLMs and chess

(dynomight.substack.com)

Show context

niobe ◴[15 Nov 24 00:40 UTC] No.42142885[source]▶

I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

replies(20): >>42142963 #>>42143021 #>>42143024 #>>42143060 #>>42143136 #>>42143208 #>>42143253 #>>42143349 #>>42143949 #>>42144041 #>>42144146 #>>42144448 #>>42144487 #>>42144490 #>>42144558 #>>42144621 #>>42145171 #>>42145383 #>>42146513 #>>42147230 #

1. xelxebar ◴[15 Nov 24 04:24 UTC] No.42143949[source]▶

>>42142885 #

Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.

If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.

Clearly, there's more going on here.

replies(5): >>42144358 #>>42145060 #>>42147213 #>>42147766 #>>42161043 #

2. flyingcircus3 ◴[15 Nov 24 06:15 UTC] No.42144358[source]▶

>>42143949 (TP) #

"playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?

replies(1): >>42144495 #

3. golol ◴[15 Nov 24 06:53 UTC] No.42144495[source]▶

>>42144358 #

I've tried it myself, GPT-3.5-turbo-instruct was at least somewhere in the rabge 1600-1800 ELO.

4. akira2501 ◴[15 Nov 24 08:51 UTC] No.42145060[source]▶

>>42143949 (TP) #

There are some who suggest that modern chess is mostly a game of memorization and not one particularly of strategy or skill. I assume this is why variants like speed chess exist.

In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.

Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.

replies(2): >>42145616 #>>42145630 #

5. mewpmewp2 ◴[15 Nov 24 10:38 UTC] No.42145616[source]▶

>>42145060 #

It is memorisatiom only after you have grandmastered reasoning and strategy.

6. DiogenesKynikos ◴[15 Nov 24 10:39 UTC] No.42145630[source]▶

>>42145060 #

Speed chess relies on skill.

I think you're using "skill" to refer solely to one aspect of chess skill: the ability to do brute-force calculations of sequences of upcoming moves. There are other aspects of chess skill, such as:

1. The ability to judge a chess position at a glance, based on years of experience in playing chess and theoretical knowledge about chess positions.

2. The ability to instantly spot tactics in a position.

In blitz (about 5 minutes) or bullet (1 minute) chess games, these other skills are much more important than the ability to calculate deep lines. They're still aspects of chess skill, and they're probably equally important as the ability to do long brute-force calculations.

replies(1): >>42147429 #

7. the_af ◴[15 Nov 24 14:25 UTC] No.42147213[source]▶

>>42143949 (TP) #

> Then you should be surprised that turbo-instruct actually plays well, right?

Do we know it's not special-casing chess and instead using a different engine (not an LLM) for playing?

To be clear, this would be an entirely appropriate approach to problem-solving in the real world, it just wouldn't be the LLM that's playing chess.

8. henearkr ◴[15 Nov 24 14:53 UTC] No.42147429{3}[source]▶

>>42145630 #

> tactics in a position

That should give patterns (hence your use of the verb to "spot" them, as the grandmaster would indeed spot the patterns) recognizable in the game string.

More specifically grammar-like parterns, e.g. the same moves but translated.

Typically what an LLM can excel at.

9. mda ◴[15 Nov 24 15:29 UTC] No.42147766[source]▶

>>42143949 (TP) #

Yes, probably there is more going on here, e.g. it is cheating.

10. niobe ◴[17 Nov 24 01:00 UTC] No.42161043[source]▶

>>42143949 (TP) #

But to some approximation we do know how an LLM plays chess.. based on all the games, sites, blogs, analysis in its training data. But it has a limited ability to tell a good move from a bad move since the training data has both, and some of it lacks context on move quality.

Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.

I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.

↑