Something weird is happening with LLMs and chess

1. Havoc ◴[15 Nov 24 01:32 UTC] No.42143134[source]▶

My money is on a fluke inclusion of more chess data in that models training.

All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation

replies(2): >>42143272 #>>42143307 #

2. bhouston ◴[15 Nov 24 01:56 UTC] No.42143272[source]▶

>>42143134 (TP) #

Yeah. This.

3. permo-w ◴[15 Nov 24 02:01 UTC] No.42143307[source]▶

>>42143134 (TP) #

I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature

these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves

replies(1): >>42143375 #

4. simonw ◴[15 Nov 24 02:16 UTC] No.42143375[source]▶

>>42143307 #

From this OpenAI paper (page 29 https://arxiv.org/pdf/2312.09390#page=29

"A.2 CHESS PUZZLES

Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."