It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
This is a very vague claim, but they can reconstruct the board from the list of moves, which I would say proves this wrong.
> LLMs have limited memory
For the recent models this is not a problem for the chess example. You can feed whole books into them if you want to.
> so they struggle to remember previous moves
Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
> They’re great at explaining chess concepts or moves but not actually competing in a match.
What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.
In what sense is chess stateless? Question: is Rxa6 a legal move? You need board state to refer to in order to decide.
https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
> What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.
Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.
Because many of those next moves were making that next move in support of some broader strategy.
That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.
I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.
So even if the rules of chess are (mostly) stateless, the resulting game itself is not.
Thus, you can't dismiss concerns about LLMs having difficulty tracking state by saying that chess is stateless. It's not, in that sense.
A friend of mine just started playing chess a few weeks ago and can beat it about 25% of the time.
It will hang pieces, and you can hang your own queen and there's about a 50% chance it won't be taken.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
Clearly, there's more going on here.
I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:
A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with
B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason
To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.
I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.
Maybe good chess, but not perfect chess. That would by definition be game-theoretically optimal, which in turn implies having to maintain no state other than your position in a large but precomputable game tree.
That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.
Here's the opposite theory: Language encodes objective reasoning (or at least, it does some of the time). A sufficiently large ANN trained on sufficiently large amounts of text will develop internal mechanisms of reasoning that can be applied to domains outside of language.
Based on what we are currently seeing LLMs do, I'm becoming more and more convinced that this is the correct picture.
The blog post demonstrates that a LLM plays chess at a decent level.
The blog post explains why. It addresses the issue of data quality.
I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.
You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.
To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.
This blog post was a great read. Very surprising, engaging, and thought provoking.
So in practice, your position actually includes the log of all moves to that point. That’s a lot more state than just what you can see on the board.
It’s hard to explain emerging mechanisms because of the nature of generation, which is one-pass sequential matrix reduction. I say this while waving my hands, but listen. Reasoning is similar to Turing complete algorithms, and what LLMs can become through training is similar to limited pushdown automata at best. I think this is a good conceptual handle for it.
“Line of thought” is an interesting way to loop the process back, but it doesn’t show that much improvement, afaiu, and still is finite.
Otoh, a chess player takes as much time and “loops” as they need to get the result (ignoring competitive time limits).
“The game is not automatically drawn if a position occurs for the third time – one of the players, on their turn, must claim the draw with the arbiter. The claim must be made either before making the move which will produce the third repetition, or after the opponent has made a move producing a third repetition. By contrast, the fivefold repetition rule requires the arbiter to intervene and declare the game drawn if the same position occurs five times, needing no claim by the players.”
In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.
Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.
Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.
But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.
while it can be played as stateless, remembering previous moves gives you insight into potential strategy that is being build.
The issue is that even that kinda obviousness is criticised here. People get mad at the idea of doing experiments when we already expect a result.
I think you're using "skill" to refer solely to one aspect of chess skill: the ability to do brute-force calculations of sequences of upcoming moves. There are other aspects of chess skill, such as:
1. The ability to judge a chess position at a glance, based on years of experience in playing chess and theoretical knowledge about chess positions.
2. The ability to instantly spot tactics in a position.
In blitz (about 5 minutes) or bullet (1 minute) chess games, these other skills are much more important than the ability to calculate deep lines. They're still aspects of chess skill, and they're probably equally important as the ability to do long brute-force calculations.
In particular, it is not an LLM and it is not trained solely on observations of chess moves.
Do we know it's not special-casing chess and instead using a different engine (not an LLM) for playing?
To be clear, this would be an entirely appropriate approach to problem-solving in the real world, it just wouldn't be the LLM that's playing chess.
That should give patterns (hence your use of the verb to "spot" them, as the grandmaster would indeed spot the patterns) recognizable in the game string.
More specifically grammar-like parterns, e.g. the same moves but translated.
Typically what an LLM can excel at.
So... unless I'm understanding something incorrectly, something like "the three last moves plus 17 bits of state" (plus the current board state) should be enough to treat chess as a memoryless process. Doesn't seem like too much to track.
Again, this isn't exactly HAL playing chess.
This means you do need to store the last 50 board positions in the worst case. Normally you need to store less because many moves are irreversible (pawns cannot go backwards, pieces cannot be un-captured).
Chess engines essentially do two things: Calculate the value of a given position for their side, and walking the tree game tree while evaluating its positions in that way.
Historically, position value was a handcrafted function using win/lose criteria (e.g. being able to give checkmate is infinitely good) and elaborate heuristics informed by real chess games, e.g. having more space on the board is good, having a high-value piece threatened by a low-value one is bad etc., and the strength of engines largely resulted from being able to "search the game tree" for good positions very broadly and deeply.
Recently, neural networks (trained on many simulated games) have been replacing these hand-crafted position evaluation functions, but there's still a ton of search going on. In other words, the networks are still largely "dumb but fast", and without deep search they'll lose against even a novice player.
This paper now presents a searchless chess engine, i.e. one who essentially "looks at the board once" and "intuits the best next move", without "calculating" resulting hypothetical positions at all. In the words of Capablanca, a chess world champion also cited in the paper: "I see only one move ahead, but it is always the correct one."
The fact that this is possible can be considered surprising, a testament to the power of transformers etc., but it does indeed have nothing to do with language or LLMs (other than that the best ones known to date are based on the same architecture).
And in the api, all of the common features like maths and search are just not there. You can implement them yourself.
You can compare with self hosted models like llama and the performance is quite similar.
You can also jailbreak and get shell into the container to get some further proof
https://github.com/adamkarvonen/chess_gpt_eval
Even the blog above says as much.
That conspiracy theory holds no traction in reality. This blog post is so far the only reference to using LLMs to play chess. The "closed-source" model (whatever that is) is an older version that does worse than the newer version. If your conspiracy theory had any bearing in reality how come this fictional "real chess engine" was only used in a single release? Unbelievable.
Back in reality, it is well known that newer models that are made available to the public are adapted to business needs by constraining their capabilities and limit liability.
Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.
I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.