Most active commenters

viraptor(6)
TZubiri(4)
golol(3)
chongli(3)

Popular/hot comments

>>42143316 #
>>42143139 #
>>42143949 #
>>42142963 #
>>42146513 #

←back to thread

Something weird is happening with LLMs and chess

(dynomight.substack.com)

1. niobe ◴[15 Nov 24 00:40 UTC] No.42142885[source]▶

>>42138289 (OP) #

I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

replies(20): >>42142963 #>>42143021 #>>42143024 #>>42143060 #>>42143136 #>>42143208 #>>42143253 #>>42143349 #>>42143949 #>>42144041 #>>42144146 #>>42144448 #>>42144487 #>>42144490 #>>42144558 #>>42144621 #>>42145171 #>>42145383 #>>42146513 #>>42147230 #

2. computerex ◴[15 Nov 24 00:55 UTC] No.42142963[source]▶

>>42142885 (TP) #

Question here is why gpt-3.5-instruct can then beat stockfish.

replies(4): >>42142975 #>>42143081 #>>42143181 #>>42143889 #

3. fsndz ◴[15 Nov 24 00:57 UTC] No.42142975[source]▶

>>42142963 #

PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

replies(1): >>42142993 #

4. computerex ◴[15 Nov 24 01:00 UTC] No.42142993{3}[source]▶

>>42142975 #

Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.

replies(2): >>42143059 #>>42144502 #

5. SilasX ◴[15 Nov 24 01:08 UTC] No.42143021[source]▶

>>42142885 (TP) #

Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")

replies(1): >>42143121 #

6. aqme28 ◴[15 Nov 24 01:09 UTC] No.42143024[source]▶

>>42142885 (TP) #

Yeah, that is the "something weird" of the article.

7. Filligree ◴[15 Nov 24 01:16 UTC] No.42143059{4}[source]▶

>>42142993 #

OP had stockfish at its weakest preset.

replies(1): >>42143193 #

8. viraptor ◴[15 Nov 24 01:17 UTC] No.42143060[source]▶

>>42142885 (TP) #

This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.

I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.

replies(2): >>42143139 #>>42143871 #

9. bluGill ◴[15 Nov 24 01:20 UTC] No.42143081[source]▶

>>42142963 #

The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it

10. cma ◴[15 Nov 24 01:29 UTC] No.42143121[source]▶

>>42143021 #

But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.

11. danielmarkbruce ◴[15 Nov 24 01:33 UTC] No.42143136[source]▶

>>42142885 (TP) #

Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.

The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.

12. grugagag ◴[15 Nov 24 01:33 UTC] No.42143139[source]▶

>>42143060 #

LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.

Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.

replies(5): >>42143316 #>>42143409 #>>42143940 #>>42144497 #>>42150276 #

13. lukan ◴[15 Nov 24 01:40 UTC] No.42143181[source]▶

>>42142963 #

Cheating (using a internal chess engine) would be the obvious reason to me.

replies(2): >>42143214 #>>42165535 #

14. fsndz ◴[15 Nov 24 01:42 UTC] No.42143193{5}[source]▶

>>42143059 #

Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16

replies(1): >>42143999 #

15. TZubiri ◴[15 Nov 24 01:45 UTC] No.42143208[source]▶

>>42142885 (TP) #

Bro, it actually did play chess, didn't you read the article?

replies(1): >>42143417 #

16. TZubiri ◴[15 Nov 24 01:46 UTC] No.42143214{3}[source]▶

>>42143181 #

Nope. Calls by api don't use functions calls.

replies(2): >>42143226 #>>42144027 #

17. permo-w ◴[15 Nov 24 01:48 UTC] No.42143226{4}[source]▶

>>42143214 #

that you know of

replies(1): >>42150883 #

18. slibhb ◴[15 Nov 24 01:52 UTC] No.42143253[source]▶

>>42142885 (TP) #

Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.

> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".

19. viraptor ◴[15 Nov 24 02:03 UTC] No.42143316{3}[source]▶

>>42143139 #

> but LLMs don’t even "see" the board

This is a very vague claim, but they can reconstruct the board from the list of moves, which I would say proves this wrong.

> LLMs have limited memory

For the recent models this is not a problem for the chess example. You can feed whole books into them if you want to.

> so they struggle to remember previous moves

Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.

> They’re great at explaining chess concepts or moves but not actually competing in a match.

What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?

replies(6): >>42143465 #>>42143481 #>>42143484 #>>42143533 #>>42145323 #>>42146931 #

20. ◴[15 Nov 24 02:11 UTC] No.42143349[source]▶

>>42142885 (TP) #

21. jerska ◴[15 Nov 24 02:23 UTC] No.42143409{3}[source]▶

>>42143139 #

LLMs need to compress information to be able to predict next words in as many contexts as possible.

Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.

22. mandevil ◴[15 Nov 24 02:24 UTC] No.42143417[source]▶

>>42143208 #

It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.

replies(2): >>42143790 #>>42147803 #

23. mjcohen ◴[15 Nov 24 02:33 UTC] No.42143465{4}[source]▶

>>42143316 #

Chess is not stateless. Three repetitions of same position is a draw.

replies(1): >>42144802 #

24. cool_dude85 ◴[15 Nov 24 02:36 UTC] No.42143481{4}[source]▶

>>42143316 #

>Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.

In what sense is chess stateless? Question: is Rxa6 a legal move? You need board state to refer to in order to decide.

replies(1): >>42143555 #

25. sfmz ◴[15 Nov 24 02:36 UTC] No.42143484{4}[source]▶

>>42143316 #

Chess is not stateless. En Passant requires last move and castling rights requires nearly all previous moves.

https://adamkarvonen.github.io/machine_learning/2024/01/03/c...

replies(1): >>42143592 #

26. ethbr1 ◴[15 Nov 24 02:46 UTC] No.42143533{4}[source]▶

>>42143316 #

> Chess is stateless with perfect information.

It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.

> What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?

Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.

Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.

Because many of those next moves were making that next move in support of some broader strategy.

replies(2): >>42143634 #>>42144422 #

27. aetherson ◴[15 Nov 24 02:50 UTC] No.42143555{5}[source]▶

>>42143481 #

They mean that you only need board position, you don't need the previous moves that led to that board position.

There are at least a couple of exceptions to that as far as I know.

replies(2): >>42143938 #>>42144645 #

28. viraptor ◴[15 Nov 24 02:57 UTC] No.42143592{5}[source]▶

>>42143484 #

Ok, I did go too far. But castling doesn't require all previous moves - only one bit of information carried over. So in practice that's board + 2 bits per player. (or 1 bit and 2 moves if you want to include a draw)

replies(1): >>42143633 #

29. aaronchall ◴[15 Nov 24 03:06 UTC] No.42143633{6}[source]▶

>>42143592 #

Castling requires no prior moves by either piece (King or Rook). Move the King once and back early on, and later, although the board looks set for castling, the King may not.

replies(1): >>42143643 #

30. viraptor ◴[15 Nov 24 03:07 UTC] No.42143634{5}[source]▶

>>42143533 #

> it's played as a series of moves connected to a player's strategy.

That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.

I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.

replies(1): >>42143743 #

31. viraptor ◴[15 Nov 24 03:08 UTC] No.42143643{7}[source]▶

>>42143633 #

Yes, which means you carry one bit of extra information - "is castling still allowed". The specific moves that resulted in this bit being unset don't matter.

replies(1): >>42143680 #

32. aaronchall ◴[15 Nov 24 03:16 UTC] No.42143680{8}[source]▶

>>42143643 #

Ok, then for this you need minimum of two bits - one for kingside Rook and one for the queenside Rook, both would be set if you move the King. You also need to count moves since the last exchange or pawn move for the 50 move rule.

replies(1): >>42143705 #

33. viraptor ◴[15 Nov 24 03:23 UTC] No.42143705{9}[source]▶

>>42143680 #

Ah, that one's cool - I've got to admit I've never heard of the 50 move rule.

replies(1): >>42143935 #

34. ethbr1 ◴[15 Nov 24 03:32 UTC] No.42143743{6}[source]▶

>>42143634 #

If we're talking about LLMs, then the state belongs to it.

So even if the rules of chess are (mostly) stateless, the resulting game itself is not.

Thus, you can't dismiss concerns about LLMs having difficulty tracking state by saying that chess is stateless. It's not, in that sense.

35. og_kalu ◴[15 Nov 24 03:43 UTC] No.42143790{3}[source]▶

>>42143417 #

You're right it's not in this blog but turbo-instruct's chess ability has been pretty thoroughly tested and it does play chess.

https://github.com/adamkarvonen/chess_gpt_eval

36. shric ◴[15 Nov 24 04:05 UTC] No.42143871[source]▶

>>42143060 #

Stockfish level 1 is well below "lowest intermediate".

A friend of mine just started playing chess a few weeks ago and can beat it about 25% of the time.

It will hang pieces, and you can hang your own queen and there's about a 50% chance it won't be taken.

37. shric ◴[15 Nov 24 04:08 UTC] No.42143889[source]▶

>>42142963 #

I'm actually surprised any of them manage to make legal moves throughout the game once out of book moves.

38. User23 ◴[15 Nov 24 04:19 UTC] No.42143935{10}[source]▶

>>42143705 #

Also the 3x repetition rule.

replies(1): >>42144595 #

39. User23 ◴[15 Nov 24 04:20 UTC] No.42143938{6}[source]▶

>>42143555 #

The correct phrasing would be is it a Markov process?

40. zeckalpha ◴[15 Nov 24 04:21 UTC] No.42143940{3}[source]▶

>>42143139 #

Language is a game with strict rules and strategies.

41. xelxebar ◴[15 Nov 24 04:24 UTC] No.42143949[source]▶

>>42142885 (TP) #

Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.

If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.

Clearly, there's more going on here.

replies(5): >>42144358 #>>42145060 #>>42147213 #>>42147766 #>>42161043 #

42. mannykannot ◴[15 Nov 24 04:35 UTC] No.42143999{6}[source]▶

>>42143193 #

That is a very pertinent question, especially if Stockfish has been used to generate training data.

43. girvo ◴[15 Nov 24 04:42 UTC] No.42144027{4}[source]▶

>>42143214 #

How can you prove this when talking about someones internal closed API?

44. mannykannot ◴[15 Nov 24 04:47 UTC] No.42144041[source]▶

>>42142885 (TP) #

One of the main purposes of running experiments of any sort is to find out if our preconceptions are accurate. Of course, if someone is not interested in that question, they might as well choose not to look through the telescope.

replies(1): >>42144411 #

45. pizza ◴[15 Nov 24 05:18 UTC] No.42144146[source]▶

>>42142885 (TP) #

But there's really nothing about chess that makes reasoning a prerequisite, a win is a win as long as it's a win. This is kind of a semantics game: it's a question of whether the degree of skill people observe in an LLM playing chess is actually some different quantity than the chance it wins.

I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:

A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with

B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason

To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.

I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.

46. flyingcircus3 ◴[15 Nov 24 06:15 UTC] No.42144358[source]▶

>>42143949 #

"playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?

replies(1): >>42144495 #

47. bowsamic ◴[15 Nov 24 06:31 UTC] No.42144411[source]▶

>>42144041 #

Sadly there’s a common sentiment on HN that testing obvious assumptions is a waste of time

replies(2): >>42144775 #>>42145625 #

48. lxgr ◴[15 Nov 24 06:35 UTC] No.42144422{5}[source]▶

>>42143533 #

> good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.

Maybe good chess, but not perfect chess. That would by definition be game-theoretically optimal, which in turn implies having to maintain no state other than your position in a large but precomputable game tree.

replies(1): >>42144634 #

49. jsemrau ◴[15 Nov 24 06:44 UTC] No.42144448[source]▶

>>42142885 (TP) #

There are many ways to test for reasoning and deterministic computation as my own work in this space has shown .

50. golol ◴[15 Nov 24 06:52 UTC] No.42144487[source]▶

>>42142885 (TP) #

Because it's a straight forward stochastic sequence modelling task and I've seen GPT-3.5-turbo-instruct play at high amateur level myself. But it seems like all the RLHF and distillation that is done on newer models destroys that ability.

51. QuesnayJr ◴[15 Nov 24 06:53 UTC] No.42144490[source]▶

>>42142885 (TP) #

They thought it because we have an existence proof: gpt-3.5-turbo-instruct can play chess at a decent level.

That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.

52. golol ◴[15 Nov 24 06:53 UTC] No.42144495{3}[source]▶

>>42144358 #

I've tried it myself, GPT-3.5-turbo-instruct was at least somewhere in the rabge 1600-1800 ELO.

53. codebolt ◴[15 Nov 24 06:53 UTC] No.42144497{3}[source]▶

>>42143139 #

> they’re made for handling language, not playing games with strict rules and strategies

Here's the opposite theory: Language encodes objective reasoning (or at least, it does some of the time). A sufficiently large ANN trained on sufficiently large amounts of text will develop internal mechanisms of reasoning that can be applied to domains outside of language.

Based on what we are currently seeing LLMs do, I'm becoming more and more convinced that this is the correct picture.

replies(1): >>42144685 #

54. golol ◴[15 Nov 24 06:54 UTC] No.42144502{4}[source]▶

>>42142993 #

You have to get the model to think in PGN data. It's crucial to use the exact PGN format it sae in its training data and to give it few shot examples.

55. scj ◴[15 Nov 24 07:06 UTC] No.42144558[source]▶

>>42142885 (TP) #

It'd be more interesting to see LLMs play Family Feud. I think it'd be their ideal game.

56. chipsrafferty ◴[15 Nov 24 07:16 UTC] No.42144595{11}[source]▶

>>42143935 #

And 5x repetition rule

57. chipdart ◴[15 Nov 24 07:22 UTC] No.42144621[source]▶

>>42142885 (TP) #

> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

The blog post demonstrates that a LLM plays chess at a decent level.

The blog post explains why. It addresses the issue of data quality.

I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.

You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.

To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.

This blog post was a great read. Very surprising, engaging, and thought provoking.

replies(1): >>42156892 #

58. chongli ◴[15 Nov 24 07:25 UTC] No.42144634{6}[source]▶

>>42144422 #

Right, but your position also includes whether or not you still have the right to castle on either side, whether each pawn has the right to capture en passant or not, the number of moves since the last pawn move or capture (for tracking the 50 move rule), and whether or not the current position has ever appeared on the board once or twice prior (so you can claim a draw by threefold repetition).

So in practice, your position actually includes the log of all moves to that point. That’s a lot more state than just what you can see on the board.

59. chongli ◴[15 Nov 24 07:28 UTC] No.42144645{6}[source]▶

>>42143555 #

Yes, 4 exceptions: castling rights, legal en passant captures, threefold repetition, and the 50 move rule. You actually need quite a lot of state to track all of those.

replies(1): >>42147799 #

60. wruza ◴[15 Nov 24 07:36 UTC] No.42144685{4}[source]▶

>>42144497 #

I share this idea but from the different perspective. It doesn’t develop these mechanisms, but casts a high-dimensional-enough shadow of their effect on itself. This vaguely explains why the more deep Gell-Mann-wise you are the less sharp that shadow is, because specificity cuts off “reasoning” hyperplanes.

It’s hard to explain emerging mechanisms because of the nature of generation, which is one-pass sequential matrix reduction. I say this while waving my hands, but listen. Reasoning is similar to Turing complete algorithms, and what LLMs can become through training is similar to limited pushdown automata at best. I think this is a good conceptual handle for it.

“Line of thought” is an interesting way to loop the process back, but it doesn’t show that much improvement, afaiu, and still is finite.

Otoh, a chess player takes as much time and “loops” as they need to get the result (ignoring competitive time limits).

61. BlindEyeHalo ◴[15 Nov 24 07:55 UTC] No.42144775{3}[source]▶

>>42144411 #

Not only on HN. Trying to publish a scientific article that does not contain the word 'novel' has become almost impossible. No one is trying to reproduce anyones claims anymore.

replies(2): >>42145089 #>>42145377 #

62. Someone ◴[15 Nov 24 08:01 UTC] No.42144802{5}[source]▶

>>42143465 #

Yes, there’s state there that’s not in the board position, but technically, threefold repetition is not a draw. Play can go on. https://en.wikipedia.org/wiki/Threefold_repetition:

“The game is not automatically drawn if a position occurs for the third time – one of the players, on their turn, must claim the draw with the arbiter. The claim must be made either before making the move which will produce the third repetition, or after the opponent has made a move producing a third repetition. By contrast, the fivefold repetition rule requires the arbiter to intervene and declare the game drawn if the same position occurs five times, needing no claim by the players.”

63. akira2501 ◴[15 Nov 24 08:51 UTC] No.42145060[source]▶

>>42143949 #

There are some who suggest that modern chess is mostly a game of memorization and not one particularly of strategy or skill. I assume this is why variants like speed chess exist.

In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.

Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.

replies(2): >>42145616 #>>42145630 #

64. pcf ◴[15 Nov 24 08:57 UTC] No.42145089{4}[source]▶

>>42144775 #

Do you think this bias is part of the replication crisis in science?

65. Cthulhu_ ◴[15 Nov 24 09:11 UTC] No.42145171[source]▶

>>42142885 (TP) #

> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.

But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.

replies(1): >>42151889 #

66. cowl ◴[15 Nov 24 09:41 UTC] No.42145323{4}[source]▶

>>42143316 #

> Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.

while it can be played as stateless, remembering previous moves gives you insight into potential strategy that is being build.

67. bowsamic ◴[15 Nov 24 09:50 UTC] No.42145377{4}[source]▶

>>42144775 #

I don't think this is about replication, but even just about the initial test in the first place. In science we do often test obvious things. For example, I was a theoretical quantum physicist, and a lot of the time I knew that what I am working on will definitely work, since the maths checks out. In some sense that makes it kinda obvious, but we test it anyway.

The issue is that even that kinda obviousness is criticised here. People get mad at the idea of doing experiments when we already expect a result.

68. jdthedisciple ◴[15 Nov 24 09:51 UTC] No.42145383[source]▶

>>42142885 (TP) #

I love how LLMs are the one subject matter where even most educated people are extremely confidently wrong.

replies(1): >>42145613 #

69. fourthark ◴[15 Nov 24 10:37 UTC] No.42145613[source]▶

>>42145383 #

Ppl acting like LLMs!

70. mewpmewp2 ◴[15 Nov 24 10:38 UTC] No.42145616{3}[source]▶

>>42145060 #

It is memorisatiom only after you have grandmastered reasoning and strategy.

71. ◴[15 Nov 24 10:39 UTC] No.42145625{3}[source]▶

>>42144411 #

72. DiogenesKynikos ◴[15 Nov 24 10:39 UTC] No.42145630{3}[source]▶

>>42145060 #

Speed chess relies on skill.

I think you're using "skill" to refer solely to one aspect of chess skill: the ability to do brute-force calculations of sequences of upcoming moves. There are other aspects of chess skill, such as:

1. The ability to judge a chess position at a glance, based on years of experience in playing chess and theoretical knowledge about chess positions.

2. The ability to instantly spot tactics in a position.

In blitz (about 5 minutes) or bullet (1 minute) chess games, these other skills are much more important than the ability to calculate deep lines. They're still aspects of chess skill, and they're probably equally important as the ability to do long brute-force calculations.

replies(1): >>42147429 #

73. motoboi ◴[15 Nov 24 13:02 UTC] No.42146513[source]▶

>>42142885 (TP) #

I suppose you didn't get the news, but google developed a LLM that can play chess. And play it at grandmaster level: https://arxiv.org/html/2402.04494v1

replies(3): >>42146547 #>>42146583 #>>42147933 #

74. suddenlybananas ◴[15 Nov 24 13:07 UTC] No.42146547[source]▶

>>42146513 #

That article isn't as impressive as it sounds: https://gist.github.com/yoavg/8b98bbd70eb187cf1852b3485b8cda...

In particular, it is not an LLM and it is not trained solely on observations of chess moves.

75. Scene_Cast2 ◴[15 Nov 24 13:12 UTC] No.42146583[source]▶

>>42146513 #

Not quite an LLM. It's a transformer model, but there's no tokenizer or words, just chess board positions (64 tokens, one per board square). It's purpose-built for chess (never sees a word of text).

replies(1): >>42149451 #

76. jackcviers3 ◴[15 Nov 24 13:52 UTC] No.42146931{4}[source]▶

>>42143316 #

You can feed them whole books, but they have trouble with recall for specific information in the middle of the context window.

77. the_af ◴[15 Nov 24 14:25 UTC] No.42147213[source]▶

>>42143949 #

> Then you should be surprised that turbo-instruct actually plays well, right?

Do we know it's not special-casing chess and instead using a different engine (not an LLM) for playing?

To be clear, this would be an entirely appropriate approach to problem-solving in the real world, it just wouldn't be the LLM that's playing chess.

78. empath75 ◴[15 Nov 24 14:26 UTC] No.42147230[source]▶

>>42142885 (TP) #

> I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

You shouldn't but there's lots of things that LLMs can do that educated people shouldn't expect it to be able to do.

79. henearkr ◴[15 Nov 24 14:53 UTC] No.42147429{4}[source]▶

>>42145630 #

> tactics in a position

That should give patterns (hence your use of the verb to "spot" them, as the grandmaster would indeed spot the patterns) recognizable in the game string.

More specifically grammar-like parterns, e.g. the same moves but translated.

Typically what an LLM can excel at.

80. mda ◴[15 Nov 24 15:29 UTC] No.42147766[source]▶

>>42143949 #

Yes, probably there is more going on here, e.g. it is cheating.

81. fjkdlsjflkds ◴[15 Nov 24 15:33 UTC] No.42147799{7}[source]▶

>>42144645 #

It shouldn't be too much extra state. I assume that 2 bits should be enough to cover castling rights (one for each player), whatever is necessary to store the last 3 moves should cover legal en passant captures and threefold repetition, and 12 bits to store two non-overflowing 6 bit counters (time since last capture, and time since last pawn move) should cover the 50 move rule.

So... unless I'm understanding something incorrectly, something like "the three last moves plus 17 bits of state" (plus the current board state) should be enough to treat chess as a memoryless process. Doesn't seem like too much to track.

replies(1): >>42148093 #

82. TZubiri ◴[15 Nov 24 15:33 UTC] No.42147803{3}[source]▶

>>42143417 #

Ah, I didn't see the ilegal move discarding.

replies(1): >>42147907 #

83. mandevil ◴[15 Nov 24 15:46 UTC] No.42147907{4}[source]▶

>>42147803 #

That was for the OpenAI games- including the ones that won. For the ones he ran himself with open source LLM's he restricted their grammar to just be legal moves, so it could only respond with a legal move. But that was because of a separate process he added on top of the LLM.

Again, this isn't exactly HAL playing chess.

84. teleforce ◴[15 Nov 24 15:48 UTC] No.42147933[source]▶

>>42146513 #

It's interesting to note that the paper benchmarked its chess playing performance against GPT-3.5-turbo-instruct, the only well performant LLM in the posted article.

85. chongli ◴[15 Nov 24 16:02 UTC] No.42148093{8}[source]▶

>>42147799 #

Threefold repetition does not require the three positions to occur consecutively. So you could conceivably have a position repeat itself for first on the 1st move, second time on the 25th move, and the third time on the 50th move of a sequence and then players could claim a draw by threefold repetition or 50 move rule at the same time!

This means you do need to store the last 50 board positions in the worst case. Normally you need to store less because many moves are irreversible (pawns cannot go backwards, pieces cannot be un-captured).

replies(1): >>42150660 #

86. lxgr ◴[15 Nov 24 18:19 UTC] No.42149451{3}[source]▶

>>42146583 #

In fact, the unusual aspect of this chess engine is not that it's using neural networks (even Stockfish does, these days!), but that it's only using neural networks.

Chess engines essentially do two things: Calculate the value of a given position for their side, and walking the tree game tree while evaluating its positions in that way.

Historically, position value was a handcrafted function using win/lose criteria (e.g. being able to give checkmate is infinitely good) and elaborate heuristics informed by real chess games, e.g. having more space on the board is good, having a high-value piece threatened by a low-value one is bad etc., and the strength of engines largely resulted from being able to "search the game tree" for good positions very broadly and deeply.

Recently, neural networks (trained on many simulated games) have been replacing these hand-crafted position evaluation functions, but there's still a ton of search going on. In other words, the networks are still largely "dumb but fast", and without deep search they'll lose against even a novice player.

This paper now presents a searchless chess engine, i.e. one who essentially "looks at the board once" and "intuits the best next move", without "calculating" resulting hypothetical positions at all. In the words of Capablanca, a chess world champion also cited in the paper: "I see only one move ahead, but it is always the correct one."

The fact that this is possible can be considered surprising, a testament to the power of transformers etc., but it does indeed have nothing to do with language or LLMs (other than that the best ones known to date are based on the same architecture).

87. nemomarx ◴[15 Nov 24 19:55 UTC] No.42150276{3}[source]▶

>>42143139 #

just curious, was this rephrased by an llm or is that your writing style?

88. fjkdlsjflkds ◴[15 Nov 24 20:33 UTC] No.42150660{9}[source]▶

>>42148093 #

Ah... gotcha. Thanks for the clarification.

89. TZubiri ◴[15 Nov 24 20:52 UTC] No.42150883{5}[source]▶

>>42143226 #

Sure. It's not hard to verify, in the user ui, function calls are very transparent.

And in the api, all of the common features like maths and search are just not there. You can implement them yourself.

You can compare with self hosted models like llama and the performance is quite similar.

You can also jailbreak and get shell into the container to get some further proof

replies(1): >>42157065 #

90. og_kalu ◴[15 Nov 24 22:21 UTC] No.42151889[source]▶

>>42145171 #

turbo instruct does play chess reasonably well.

https://github.com/adamkarvonen/chess_gpt_eval

Even the blog above says as much.

91. wibwobble12333 ◴[16 Nov 24 15:29 UTC] No.42156892[source]▶

>>42144621 #

The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes. There’s nothing thought provoking about a bunch of engineers doing “experiments” against a service, other than how sad it is to debase themselves in this way.

replies(1): >>42157193 #

92. permo-w ◴[16 Nov 24 16:09 UTC] No.42157065{6}[source]▶

>>42150883 #

this is all just guesswork. it's a black box. you have no idea what post-processing they're doing on their end

93. chipdart ◴[16 Nov 24 16:27 UTC] No.42157193{3}[source]▶

>>42156892 #

> The only service performing well is a closed source one that could simply use a real chess engine for questions that look like chess, for marketing purposes.

That conspiracy theory holds no traction in reality. This blog post is so far the only reference to using LLMs to play chess. The "closed-source" model (whatever that is) is an older version that does worse than the newer version. If your conspiracy theory had any bearing in reality how come this fictional "real chess engine" was only used in a single release? Unbelievable.

Back in reality, it is well known that newer models that are made available to the public are adapted to business needs by constraining their capabilities and limit liability.

94. niobe ◴[17 Nov 24 01:00 UTC] No.42161043[source]▶

>>42143949 #

But to some approximation we do know how an LLM plays chess.. based on all the games, sites, blogs, analysis in its training data. But it has a limited ability to tell a good move from a bad move since the training data has both, and some of it lacks context on move quality.

Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.

I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.

95. nske ◴[17 Nov 24 17:40 UTC] No.42165535{3}[source]▶

>>42143181 #

But in that case there shouldn't be any invalid moves, ever. Another tester found gpt-3.5-turbo-instruct to be suggesting at least one illegal move in 16% of the games (source: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ )

↑