Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"
That is, the groups are encoding something the model doesn't have to learn.
This is not much astray from "sight words" we teach kids.
This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
1. That would just be plain bizzare
2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame
4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval
5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.
E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
Yeah, once you've deviated from a sequence you're lost.
Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.
Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
Likewise:
- The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
- I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.
* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more
** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *
Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.
I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.
* I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.
There is no advantage to tokenization, it just helps solve limitations in context windows and training.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out
"Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"
https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
Do these models actually think about a board? Chess engines do, as much as we can say that any machine thinks. But do LLMs?
I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.
Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.
Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)
> It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).
Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.
Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...
these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves
This is a very vague claim, but they can reconstruct the board from the list of moves, which I would say proves this wrong.
> LLMs have limited memory
For the recent models this is not a problem for the chess example. You can feed whole books into them if you want to.
> so they struggle to remember previous moves
Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
> They’re great at explaining chess concepts or moves but not actually competing in a match.
What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
"A.2 CHESS PUZZLES
Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."
Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.
ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
Or maybe it's able to recognise the chess game, then get moves from an external chess game API?
In what sense is chess stateless? Question: is Rxa6 a legal move? You need board state to refer to in order to decide.
https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
> What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.
Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.
Because many of those next moves were making that next move in support of some broader strategy.
(And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)
That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.
I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.
So even if the rules of chess are (mostly) stateless, the resulting game itself is not.
Thus, you can't dismiss concerns about LLMs having difficulty tracking state by saying that chess is stateless. It's not, in that sense.
My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.
A friend of mine just started playing chess a few weeks ago and can beat it about 25% of the time.
It will hang pieces, and you can hang your own queen and there's about a 50% chance it won't be taken.
If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.
Clearly, there's more going on here.
Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.
I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:
A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with
B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason
To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.
I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.
It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.
Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.
Maybe good chess, but not perfect chess. That would by definition be game-theoretically optimal, which in turn implies having to maintain no state other than your position in a large but precomputable game tree.
Also, it makes a lot of sense if you expect people to play chess against the LLM, especially if you are later training future models on the chats.
When ChatGPT3.5 first came out, people were using it to simulate entire Linux system installs, and even browsing a simulated Internet.
Cool use cases like that aren't even discussed anymore.
I still wonder what sort of magic OpenAI had and then locked up away from the world in the name of cost savings.
Same thing with GPT 4 vs 4o, 4o is obviously worse in some ways, but after the initial release (when a bunch of people mentioned this), the issue has just been collectively ignored.
That was the point of the post (though you have to read it to the end to see this). That one model can play chess pretty well, while the free models and OpenAI's later models can't. That's weird.
Here's the opposite theory: Language encodes objective reasoning (or at least, it does some of the time). A sufficiently large ANN trained on sufficiently large amounts of text will develop internal mechanisms of reasoning that can be applied to domains outside of language.
Based on what we are currently seeing LLMs do, I'm becoming more and more convinced that this is the correct picture.
The discussion is about general intelligence, the model isn't able to do a task that it can do simply because it chooses the wrong strategy, that is a problem of lack of generalization and not a problem of tokenization. Being able to choose the right strategy is core to general intelligence, altering input data to make it easier for the model to find the right solution to specific questions does not help it become more general, you just shift what narrow problems it is good at.
karpathy agrees with you, here he is hating on tokenizers while re-building them for 2h
The blog post demonstrates that a LLM plays chess at a decent level.
The blog post explains why. It addresses the issue of data quality.
I don't understand what point you thought you were making. Regardless of where you stand, the blog post showcases a surprising result.
You stress your prior unfounded belief, you were presented with data that proves it wrong, and your reaction was to post a comment with a thinly veiled accusation of people not being educated when clearly you are the one that's off.
To make matters worse, this topic is also about curiosity. Which has a strong link with intelligence and education. And you are here criticizing others on those grounds in spite of showing your defitic right at the first sentence.
This blog post was a great read. Very surprising, engaging, and thought provoking.
So in practice, your position actually includes the log of all moves to that point. That’s a lot more state than just what you can see on the board.
It's just a lossy compression of all of the parameters, probably not important, right?
It’s hard to explain emerging mechanisms because of the nature of generation, which is one-pass sequential matrix reduction. I say this while waving my hands, but listen. Reasoning is similar to Turing complete algorithms, and what LLMs can become through training is similar to limited pushdown automata at best. I think this is a good conceptual handle for it.
“Line of thought” is an interesting way to loop the process back, but it doesn’t show that much improvement, afaiu, and still is finite.
Otoh, a chess player takes as much time and “loops” as they need to get the result (ignoring competitive time limits).
“The game is not automatically drawn if a position occurs for the third time – one of the players, on their turn, must claim the draw with the arbiter. The claim must be made either before making the move which will produce the third repetition, or after the opponent has made a move producing a third repetition. By contrast, the fivefold repetition rule requires the arbiter to intervene and declare the game drawn if the same position occurs five times, needing no claim by the players.”
You say it makes sense but how does it make sense for OpenAI to add overhead to all of its API calls for the super niche case of people playing 1800 ELO chess/chat bots? (that often play illegal moves, you can go try it yourself)
In this scope, my mental model is that LLMs would be good at modern style long form chess, but would likely be easy to trip up with certain types of move combinations that most humans would not normally use. My prediction is that once found they would be comically susceptible to these patterns.
Clearly, we have no real basis for saying it is "good" or "bad" at chess, and even using chess performance as an measurement sample is a highly biased decision, likely born out of marketing rather than principle.
Algebraic notation is completely straightforward.
It's not playing against a GM, the prompt just phrases it this way. I couldn't pinpoint the exact ELO of "lowest" stockfish settings, but it should be roughly between 1000 and 1400, which is far from professional play.
Because it would be super cool; curiosity isn't something to be frowned upon. If it turned out it did play chess reasonably well, it would mean emergent behaviour instead of just echoing things said online.
But it's wishful thinking with this technology at this current level; like previous instances of chatbots and the like, while initially they can convince some people that they're intelligent thinking machines, this test proves that they aren't. It's part of the scientific process.
First, tokenization: the tokenization of 1229 is not guaranteed to be [1,2,2,9] but it could very well be [12,29] and the "+1" operation could easily generate tokens [123,0] depending on frequencies in your corpus. This constant shifting in tokens makes it really hard to learn rules for "+1" ([9,9] +1 is not [9,10]). This is also why LLMs tend to fail at tasks like "how many letters does this word have?": https://news.ycombinator.com/item?id=41058318
Second, you need your network to understand that "+1" is worth learning. Writing "+1" as a combination of sigmoid, products and additions over normalized floating point values (hello loss of precision) is not trivial without degrading a chunk of your network, and what for? After all, math is not in the domain of language and, since we're not training an LMM here, your loss function may miss it entirely.
And finally there's statistics: the three-legged-dog problem is figuring out that a dog has four legs from corpora when no one ever writes "the four-legged dog" because it's obvious, but every reference to an unusual dog will include said description. So if people write "1+1 equals 3" satirically then your network may pick that up as fact. And how often has your network seen the result of "6372 + 1"?
But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954. I'm not optimistic that it will make it through without hallucinations.
Chess-GPT's Internal World Model https://adamkarvonen.github.io/machine_learning/2024/01/03/c... discussed here https://news.ycombinator.com/item?id=38893456
while it can be played as stateless, remembering previous moves gives you insight into potential strategy that is being build.
The issue is that even that kinda obviousness is criticised here. People get mad at the idea of doing experiments when we already expect a result.
Unlike people, how you ask the question really really affects the output quality.
I think you're using "skill" to refer solely to one aspect of chess skill: the ability to do brute-force calculations of sequences of upcoming moves. There are other aspects of chess skill, such as:
1. The ability to judge a chess position at a glance, based on years of experience in playing chess and theoretical knowledge about chess positions.
2. The ability to instantly spot tactics in a position.
In blitz (about 5 minutes) or bullet (1 minute) chess games, these other skills are much more important than the ability to calculate deep lines. They're still aspects of chess skill, and they're probably equally important as the ability to do long brute-force calculations.
Source: I'm at OpenAI and I was one of the first people to ever play chess against the GPT-4 base model. You may or may not trust OpenAI, but we're just a group of people trying earnestly to build cool stuff. I've never seen any inkling of an attempt to cheat evals or cheat customers.
I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.
There's nothing shady going on, I promise.
I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).
[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/
That's the problem with closed models, we can never know what they're doing.
For context this [1] is the board position the model is being prompted on.
There may be more than one weird thing about this experiment, for example giving instructions to the non-instruction tuned variants may be counter productive.
More importantly let's say you just give the model the truncated PGN, does this look like a position where white is a grandmaster level player? I don't think so. Even if the model understood chess really well it's going to try to predict the most probable move given the position at hand, if the model thinks that white is a bad player, and the model is good at understanding chess, it's going to predict bad moves as the more likely ones because that would better predict what is most likely to happen here.
LLMs don’t do reasoning or exploration, but they write text based on precious text. So to us it may seem playing, but is really a smart guesswork based on previous games. It’s like Kasparov writing moves without imagining the actual placement.
What would be interesting is to see whether a model, given only the rules, will play. I bet it won’t.
At this moment it’s replaying by memory but definitely not chasing goals. There’s no such think as forward attention yet, and beam search is expensive enough, so one would prefer to actually fallback to classic chess algos.
The only way it could be true is if that model recognized and replayed the answer to the game from memory.
* What if you allow the model to do Chain of Thought (explicitly disallowed in this experiment)
* What if you explain the board position at each step to the model in the prompt, so it doesn't have to calculate/estimate it internally.
To be clear, I'm not saying that the theory is true but just that I could belive something like that could happen.
The responses vary with the user’s chess level; some find the feedback useful, while others do not. To address this, I’ve integrated a like, dislike, and request new feedback feature into the app, allowing users to actively seek better feedback.
Btw, different from OP's setup, I opted to input the FEN of the current board and the subsequent move in standard algebraic notation to request feedback, as I found these inputs to be clearer for the LLM compared to giving the PGN of the game.
AI Chess GPT https://apps.apple.com/tr/app/ai-chess-gpt/id6476107978 https://play.google.com/store/apps/details?id=net.padma.app....
Thanks
Okay, so "Excellent" still means probably quite bad. I assume at the top difficult setting gpt-3.5-turbo-instruct will still lose badly.
OpenAI has never done anything except conversational agents.
That to say, we can literal say anything because this is very shadowy/murky, but since everything is likely a question of money... should, _probably_, be not very fair away from the truth...
Anyway humans have to tokenize, too. We don't perceive the world as a continuous blob either.
In particular, it is not an LLM and it is not trained solely on observations of chess moves.
Benchmark optimizations are slightly embarrassing at worst, and an "optimization for a specific use case" at best. There's no regulation against optimizing for a particular task, everyone does it all the time, in some cases it's just not communicated transparently.
Phone manufacturers were caught "optimizing" for benchmarks again and again, removing power limits to boost scores. Hard to name an example without searching the net because it's at most a faux pas.
It’s as if OpenAI detects the IP address from a benchmark organization, and then used a completely different model.
BTW, a year ago when I used FEN for chess playing, LLMs would very quickly/often make illegal moves. (The article prompts me to check has that changed...)
If I tell an "agent", whether human or artificial, to win at chess, it is a good decision for that agent to decide to delegate that task to a system that is good at chess. This would be obvious to a human agent, so presumably it should be obvious to an AI as well.
This isn't useful for AI researchers, I suppose, but it's more useful as a tool.
(This may all be a good thing, as giving AIs true agency seems scary.)
Not quite. They programmed their "prod" software to recognise the circumstances of a laboratory test and behave differently. Namely during laboratory emissions testing they would activate emission control features they would not activate otherwise.
The software was the same they flash on production cars. They were production cars. You could take a random car from a random dealership and it would have done the same trickery in the lab.
As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.
Also, legality isn't the only difference in the VW case. With VW, they had a "good emissions" mode. They enabled the good emissions mode during the test, but disabled it during regular driving. It would have worked during regular driving, but they disabled it during regular driving. With compilers, there's no "good performance" mode that would work during regular usage that they're disabling during regular usage.
“In the summer of 2018, simply training OpenAI's Dota 2 bots required renting 128,000 CPUs and 256 GPUs from Google for multiple weeks.”
It sounds like a mismatch of definition, but I doubt you're ambivalent about a behavior right until the moment it becomes illegal, after which you think it unethical. Law is the codification and enforcement of a social contract, not the creation of it.
VW cheated such that the low emissions were only active during the test- it’s not that it was optimized for low emissions under the conditions they test for, but that you could not get those low emissions under any conditions in the real world. That's "cheating on the test" not "studying for the test."
Either way it's worth repeating the experiment imo, tweaking some of these variables (prompt guidance, stockfish strength, starting position, the name of the supposed players, etc.).
[1]: https://www.365chess.com/search_result.php?search=1&p=1&m=8&...
Key differences:
1. Intent and harm: • VW’s actions directly violated laws and had environmental and health consequences. Optimizing LLMs for chess benchmarks, while arguably misleading, doesn’t have immediate real-world harms. 2. Scope: Chess-specific optimization is generally a transparent choice within AI research. It’s not a hidden “defeat device” but rather an explicit design goal. 3. Broader impact: LLMs fine-tuned for benchmarks often still retain general-purpose capabilities. They aren’t necessarily “broken” outside chess, whereas VW cars fundamentally failed to meet emissions standards.
People are alleging that OpenAI is calling out to a chess engine, but seem to be not considering this less scandalous possibility.
Of course, to the extent people are touting chess performance as evidence of general reasoning capabilities, OpenAI taking costly actions to boost specifically chess performance and not being transparent about it is still frustrating and, imo, dishonest.
These days most compilers are trying to make the general case of code fast and they rarely look for benchmarks. I won't say they never do this - just that it is much less common - if only because magazine reviews/benchmarks are not nearly as important as they used to be and so the incentive is gone.
None of these changes are explained to the LLM, so if it can tell it's still chess, it must deduce this on its own.
Would any LLM be able to play at a decent level?
OpenAI might have thought Chess is good to optimize for but it wasn't seen as useful so they dropped it.
This is what people refer to as "lobotomy", ai models are wasting compute on knowing how loud the cicadas are and how wide the green cockroach is when mating.
Good models are about the training data you push in em
Do we know it's not special-casing chess and instead using a different engine (not an LLM) for playing?
To be clear, this would be an entirely appropriate approach to problem-solving in the real world, it just wouldn't be the LLM that's playing chess.
There are many cases where I think that. Examples:
* Underage drinking. If it's legal for someone to drink, I think it's in general ethical. If it's illegal, I think it's in general unethical.
* Tax avoidance strategies. If the IRS says a strategy is allowed, I think it's ethical. If the IRS says a strategy is not allowed, I think it's unethical.
* Right on red. If the government says right on red is allowed, I think it's ethical. If the government (e.g. NYC) says right on red is not allowed, I think it's unethical.
The VW case was emissions regulations. I think they have an ethical obligation to obey emissions regulations. In the absence of regulations, it's not an obvious ethical problem to prioritize fuel efficiency instead of emissions (that's I believe what VW was doing).
If you think about how our brains handle this data input, it absolutely does not split them up between the letter and the number, although the presence of both the letter and number together would trigger the same 2 tokens I would think
Law and ethics are barely related, in practice.
For example in the vehicle emissions context, it's worth noting that even well before VW was caught the actions of likely all carmakers affected by the regulations (not necessarily to the same extent) were clearly unethical. The rules had been subject to intense clearly unethical lobbying for years, and so even the legal lab results bore little resemblance to practical on-the-road results though systematic (yet legal) abuse. I wouldn't be surprised to learn that even what was measured intentionally diverged from what is harmfully in a profitable way. It's a good thing VW was made an example of - but clearly it's not like that resolved the general problem of harmful vehicle emissions. Optimistically, it might have signaled to the rest of the industry and VW in particular to stretch the rules less in the future.
These LLM's just exhibited agency.
Swallow your pride.
Since LLM's knows people knock off/test/run afoul/mistakes can be made, it would then raise that as a possibility and likely inquire.
That should give patterns (hence your use of the verb to "spot" them, as the grandmaster would indeed spot the patterns) recognizable in the game string.
More specifically grammar-like parterns, e.g. the same moves but translated.
Typically what an LLM can excel at.
Most likely because they want people to think the system is better than it is for hype purposes.
I should temper my level of impressed with only if it’s doing this dynamically . Hardcoding recognition of chess moves isn’t exactly a difficult trick to pull given there’s like 3 standard formats…
You might consider disregarding the government’s preventative measures unethical, and doing those things might be the way someone disregards the governments protective guidelines, but that doesn’t make those actions unethical any more than governments explicitly legalizing something makes it ethical.
To use a clearer example, the ethicality of abortion— regardless of what you think of it— is not changed by its legal status. You might consider violating the law unethical, so breaking abortion laws would constitute the same ethical violation as underage drinking, but those laws don’t change the ethics of abortion itself. People who consider it unethical still consider it unethical where it’s legal, and those that consider it ethical still consider it ethical where it’s not legal.
So... unless I'm understanding something incorrectly, something like "the three last moves plus 17 bits of state" (plus the current board state) should be enough to treat chess as a memoryless process. Doesn't seem like too much to track.
It used to be common for graphics drivers to outright cheat on benchmarks (the actual image produced would not be the same as it would have been if a benchmark had not been detected); this was arguably, fraud.
It used to be common for mobile phone manufacturers to allow the SoC to operate in a thermal mode that was never available to real users when it detected a benchmark was being used. This is still, IMO, kinda fraud-y.
Optimisation for common benchmark cases where the thing still actually _works_, and where the optimisation is available to normal users where applicable, is less egregious, though, still, IMO, Not Great.
Again, this isn't exactly HAL playing chess.
I say this because the be test still uses the same hardware (model) but changed the way it behaved by running emissions friendly parameters ( a different execution framework) that wouldn’t have been used in everyday driving, where fuel efficiency and performance optimized parameters were used instead.
What I’d like to know is if it actually was unethical or not. The overall carbon footprint of the lower fuel consumption setting, with fuel manufacturing and distribution factored in, might easily have been more impactful than the emissions model, which typically does not factor in fuel consumed.
We build layered, non-nested gestalts out of real time analog inputs. As a small example, the meaning of a sentence said with the same precise rhythm and intonation can be meaningfully changed by a gesture made while saying it. That can't be tokenized, and that isn't what's happening.
This means you do need to store the last 50 board positions in the worst case. Normally you need to store less because many moves are irreversible (pawns cannot go backwards, pieces cannot be un-captured).
If the rules themselves are bad and go against deeper morality, then it's a different situation; violating laws out of civil disobedience, emergent need, or with a principled stance is different from wanton, arbitrary, selfish cheating.
If a law is particularly unjust, violating the law might itself be virtuous. If the law is adequate and sensible, violating it is usually wrong even if the violating action could be legal in another sensible jurisdiction.
I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.
But still, yes, something maybe a teeny tiny bit weird is going on, in the sense that only one of the LLMs could beat it. The arxiv paper that came out recently was much more "weird" and interesting than this, though. This will probably be met with a mundane explanation soon enough, I'd guess.
If England had been in the Chinese sphere of influence rather than the Roman one, English would presumably be written with Chinese characters too. The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
Tell me you haven't been following this field without telling me you haven't been following this field[0][1][2]?
[0]: https://github.com/openai/gym
[1]: https://openai.com/index/jukebox/
[2]: https://openai.com/index/openai-five-defeats-dota-2-world-ch...
>'thinking' vs 'just recombinating things
If there is a difference, and LLM's can do one but not the other... >By that standard (and it is a good standard), none of these "AI" things are doing any thinking
>"Does it generalize past the training data" has been a pre-registered goalpost since before the attention transformer architecture came on the scene.
Then what the fuck are they doing.Learning is thinking, reasoning, what have you.
Move goalposts, re-define words, it won't matter.
https://lichess.org/BRceyegK -- the game, you'll see it make the ultimate classic opening errors
https://lichess.org/ -- try yourself! It's really so bad, it's good fun. Click "play with computer" on the right, then level 1 is already selected, you hit go
How do you know it didn't just write a script that uses a chess engine and then execute the script? That IMO is the easiest explanation.
Also, I looked at the gpt-3.5-turbo-instruct example victory. One side played with 70% accuracy and the other was 77%. IMO that's not on par with 27XX ELO.
That is, sometimes, sufficient.
If government says ‘seller of a house must disclose issues’ then I rely rely on the law being followed, if you sell and leave the country, you have defrauded me.
However if I live in a ‘buyer beware’ jurisdiction, then I know I cannot trust the seller and I hire a surveyor and take insurance.
There is a degree of setting expectations- if there is a rule, even if it’s a terrible rule, I as individual can at least take some countermeasures.
You can’t take countermeasures against all forms of illegal behaviour, because there is infinite number of them. And a truly insane person is unpredictable at all.
- "...for the closed (OpenAI) models I tried generating up to 10 times and if it still couldn’t come up with a legal move, I just chose one randomly."
- "I ran all the open models (anything not from OpenAI, meaning anything that doesn’t start with gpt or o1) myself using Q5_K_M quantization"
- "...if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space)"
- "I used a temperature of 0.7 for all the open models and the default for the closed (OpenAI) models."
Between the tokenizer weirdness, temperature, quantization, random moves, and the chess prompt, there's a lot going on here. I'm unsure how to interpret the results. Fascinating article though!
I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.
If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.
Out of curiosity, what are these companies? And where do they operate.
I'm always interested in these sorts of "hidden" industries. See also: outsourced Facebook content moderation in Kenya.
“Production” is a factory line producing cars. The software is uploaded on the ECUs by some factory machine automatically. Each car are exactly the same, with the exact same software version on thousands and thousands of cars. The cars are sold to customers.
Some small number of these prodiction cars are sent for regulatory compliance checks to third parties. But those cars won’t become suddenly non-production cars just because someone sticks up a probe in their exhausts. The same way gmail’s production servers don’t suddenly turn into test environments just because a user opens the network tab in their browser’s dev tool to see what kind of requests fly on the wire.
Chess engines essentially do two things: Calculate the value of a given position for their side, and walking the tree game tree while evaluating its positions in that way.
Historically, position value was a handcrafted function using win/lose criteria (e.g. being able to give checkmate is infinitely good) and elaborate heuristics informed by real chess games, e.g. having more space on the board is good, having a high-value piece threatened by a low-value one is bad etc., and the strength of engines largely resulted from being able to "search the game tree" for good positions very broadly and deeply.
Recently, neural networks (trained on many simulated games) have been replacing these hand-crafted position evaluation functions, but there's still a ton of search going on. In other words, the networks are still largely "dumb but fast", and without deep search they'll lose against even a novice player.
This paper now presents a searchless chess engine, i.e. one who essentially "looks at the board once" and "intuits the best next move", without "calculating" resulting hypothetical positions at all. In the words of Capablanca, a chess world champion also cited in the paper: "I see only one move ahead, but it is always the correct one."
The fact that this is possible can be considered surprising, a testament to the power of transformers etc., but it does indeed have nothing to do with language or LLMs (other than that the best ones known to date are based on the same architecture).
> That can't be tokenized
Oh ye of little imagination.
If I'm an undergrad doing a math assignment and want to check an answer, I may have no idea that symbolic algebra tools exist or how to use them. But if an all-purpose LLM gets a screenshot of a math equation and knows that its best option is to pass it along to one of those tools, that's valuable to me even if it isn't valuable to a mathematician who would have just cut out of the LLM middle-man and gone straight to the solver.
There are probably a billion examples like this. I'd imagine lots of people are clueless that software exists which can help them with some problem they have, so an LLM would be helpful for discovery even if it's just acting as a pass-through.
What would "winning" or "losing" even mean if all of this was against a single move?
But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.
But the residents of England do in fact speak English, and English is a phonetic language, so there's an inherent impedance mismatch between Chinese characters and English language. I can make up words in English and write them down which don't necessarily have Chinese written equivalents (and probably, vice-versa?).
This paper posits that if the authors intuition was true then they would find certain empirical results. ie. "If A then B." Then they test and find the empirical results. But this does not imply that their intuition was correct, just as "If A then B" does not imply "If B then A."
If the empirical results were due to tokenization absolutely nothing about this paper would change.
The problem is – in writing Japanese with kanji, lots of somewhat arbitrary decisions had to be made. Which kanji to use for which native Japanese word? There isn't always an obviously best choice from first principles. But that's not a problem in practice, because a tradition developed of which kanjii to use for which Japanese word (kun'yomi readings). For English, however, we don't have such a tradition. So it isn't clear which Chinese character to use for each English word. If two people tried to write English with Chinese characters independently, they'd likely make different character choices, and the mutual intelligibility might be poor.
Also, while neither Japanese nor Korean belongs to the same language family as Chinese, both borrowed lots of words from Chinese. In Japanese, a lot of use of kanji (especially on'yomi reading) is for borrowings from Chinese. Since English borrowed far less terms from Chinese, this other method of "deciding which character(s) to use" – look at the word's Chinese etymology – largely doesn't work for English given very few English words have Chinese etymology.
Finally, they also invented kanji in Japan for certain Japanese words – kokuji. The same thing happened for Korean Hanja (gukja), to a lesser degree. Vietnamese Chữ Nôm contains thousands of invented-in-Vietnam characters. Probably, if English had adopted Chinese writing, the same would have happened. But again, deciding when to do it and if so how is a somewhat arbitrary choice, which is impossible outside of a real societal tradition of doing it.
> The fact that it used an alphabet instead is a historical accident, not due to any grammatical property of the language.
Using the Latin alphabet changed English, just as using Chinese characters changed Japanese, Korean and Vietnamese. If English had used Chinese characters instead of the Latin alphabet, it would be a very different language today. Possibly not in grammar, but certainly in vocabulary.
Count the number of 3s, only output a single number: 6 5 3 2 8 7 1 3 3 9.
ChatGPT: 3.
That’s not what I mean at all. I mean even if spoken English were exactly the same as it is now, it could have been written with Chinese characters, and indeed would have been if England had been in the Chinese sphere of cultural influence when literacy developed there.
> English is a phonetic language
What does it mean to be a “phonetic language”? In what sense is English “more phonetic” than the Chinese languages?
> I can make up words in English and write them down which don’t necessarily have Chinese written equivalents
Of course. But if English were written with Chinese characters people would eventually agree on characters to write those words with, just like they did with all the native Japanese words that didn’t have Chinese equivalents but are nevertheless written with kanji.
Here is a famous article about how a Chinese-like writing system would work for English: https://www.zompist.com/yingzi/yingzi.htm
And in the api, all of the common features like maths and search are just not there. You can implement them yourself.
You can compare with self hosted models like llama and the performance is quite similar.
You can also jailbreak and get shell into the container to get some further proof
I mean it already hands off a wide range of tasks to python… this would be no different.
> Update: OK, I actually think I've figured out what's causing this. I'll explain in a future post, but in the meantime, here's a hint: I think NO ONE has hit on the correct explanation!
well now we are curious!
(Nothing wrong with it! It's just a bit more generic than the original topic.)
https://github.com/adamkarvonen/chess_gpt_eval
Even the blog above says as much.
And the worker AIs "evolve" to meet/exceed expectations only on tasks directly contributing to KPIs the manager AIs measure for - via the mechanism of discarding the "less fit to exceed KPIs".
And some of the worker AIs who're trained on recent/polluted internet happen to spit out prompt injection attacks that work against the manager AIs rank stacking metrics and dominate over "less fit" worker AIs. (Congratulations, we've evolved AI cancer!) These manager AIs start performing spectacularly badly compared to other non-cancerous manager AIs, and die or get killed off by the VC's paying for their datacenters.
Competing manager AIs get training, perhaps on on newer HN posts discussing this emergent behavior of worker AIs, and start to down rank any exceptionally performing worker AIs. The overall trends towards mediocrity becomes inevitable.
Some greybread writes some Perl and regexes that outcompete commercial manager AIs on pretty much every real world task, while running on a 10 year old laptop instead of a cluster of nuclear powered AI datacenters all consuming a city's worth of fresh drinking water.
Nobody in powerful positions care. Humanity dies.
Whether a chess move is legal isn't primarily a question of grammar. It's a question of the board state. "White king to a5" is a perfectly legal move, as long as the white king was next to a5 before the move, and it's white's turn, and there isn't a white piece in a5, and a5 isn't threatened by black. Otherwise it isn't.
"White king to a9" is a move that could be recognized and blocked by a grammar, but how relevant is that?
Worth noting that the relationship between characters to token ratio is probably quadratic or cubic or some other polynomial. So the difference in terms of computational difficulty is probably huge when compared to a character per token.
Here, basically you would like the "best" or "most probable" answer. With 0.7 you ask the llm to be more creative, meaning randomly picking between more less probable moves. This temperature is even lower to what is commonly used for chat assistant (around 0.8).
If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?
This doesn't require the "highest" settings, it requires any settings whatsoever.
But anyway to spell out some of the huge list of unjustified conditions here:
1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.
2. They used a terrible chess engine for some reason.
3. They did this deliberately because they didn't want to get "caught" for some reason.
4. They removed this functionality in all other versions of gpt for some reason ...etc
Much simpler theory:
1. They used more chess data training that model.
(there are other competing much simpler theories too)
Written English vs written Chinese.
How would you write, in Chinese, the words thingamajibber, gizmosity, or half the things that come out of AvE's mouth? These words have subtle, humorous, and entertaining meanings by way of twisting the sounds of other existing words. Shakespeare was a master of this kind of wordplay and invented a surprising number of words we use today.
I'm not saying you can't have the same phenomenon in spoken Chinese. But how do you write it down without a phonetic alphabet? And if you can't write it down, how do you share it to a wide audience?
Been excited to try this all day, finally got around to this, Llama 3.1 8B did it. It's my app built on llama.cpp, no shenangians, temp 0, top p 100, 4 bit quantization, model name in screenshot [^1].
I did 7824 to 8948, it protested more for 9954, which made me reconsider whether I'd want to read that many to double check :) and I figured x + 1024 is isomorphic to the original case of you trying on OpenAI and wondering if it wasn't the result of inference.
My prior was of course it would do this, its a sequence. I understand e.g. the need for token healing cases as you correctly note, that could mess up when there's e.g. notation in an equation that prevents the "correct" digit. I don't see any reason why it'd mess up a sequential list of integers.
In general, as long as its on topic, I find the handwaving people do about tokenization being a problem to be a bit silly, I'd definitely caution against using the post you linked as a citation, it reads just like a rote repetition of the idea it causes problems, its an idea that spreads like telephone.
It's also a perfect example of the weakness of the genre: just because it sees [5077, 5068, 5938] instead of "strawberry" doesn't mean it can't infer 5077 = st = 0 5068 = raw = 1 r, 5938 = berry = 2 rs. In fact, it infers things from broken up subsequences all the time -- its how it works! If doing single character tokenization got free math / counting reliability, we'd very quickly switch to it.
(not saying you're advocating for the argument or you're misinformed, just, speaking colloquially like I would with a friend over a beer)
With Chinese characters, of course. Why wouldn’t you be able to?
In English “thing”, “a”, and “ma” are already words, and “jibber” would presumably be the first character in “gibberish”. So you could write that made-up word by combining those four characters.
> But how do you write it down without a phonetic alphabet?
In general to write a newly coined word you would repurpose characters that sound the same as the newly coined word.
Every syllable that can possibly be uttered according to mandarin phonology is represented by some character (usually many), so this is always possible.
---
Regardless, to reiterate the original point: I'm not claiming Chinese characters are better or more flexible than alphabetic writing. They're not. I'm simply claiming that there's no inherent property of Japanese that makes it more amenable to representation with Chinese characters than English is (other than the fact that a lot of its vocabulary comes from Chinese, but that's not a real counterpoint given that there is lots of native, non-Chinese-derived vocabulary that's still written with kanji).
It would be possible to write Japanese entirely in the Latin alphabet, or English entirely with some system similar to Chinese characters, with minimal to no change to the structure of the language.
If a human said they could code, you don’t expect them to somehow turn into a Python interpreter and execute it in their brain. If a human said they could play chess, I’d raise an eyebrow if they just played the moves Stockfish gave them against me.
C: 唐納·川普, "Thangnar Changpooh"
J: ドナルド・トランプ, "Donaludo Toranpu"
K: 도널드 트럼프, "D'neldeh Tlempeuh"
> What does it mean to be a “phonetic language”?Means the script is intended to record pronunciation rather than intention, e.g. it's easy to see how "cow" is intended to be pronounced but it's not necessarily clear what a cow is; ideographic script on the other hand focuses on meaning, e.g. "魚" is supposed to look like a fish but pronunciation varies from "yueh", "sakana", "awe", etc.
1: I tried looking up other notable figures, but thought this person having entertainment background tends to illustrate the point more clearly
what? No, anything but IPA(only technically) and that language's native writings work for pronunciations. Hiragana, Hangul, or Chữ Quốc Ngữ, would not exist otherwise.
e: would _not_ exist
And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.
What if I make sure to have a drink once a week for the summer with my 18 year old before they go to college because I want them to understand what it's like before they go binge with friends? Is that not ethical?
Speeding to the hospital in an emergency? Lying to Nazis to save a Jew?
Law and ethics are more correlated than some are saying here, but the map is not the territory, and it never will be.
Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.
As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.
And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.
I can’t imagine any programmer in this thread would be entertaining a more complicated scenario than this. You can substitute chess for any formal system that has a reliable oracle.
That conspiracy theory holds no traction in reality. This blog post is so far the only reference to using LLMs to play chess. The "closed-source" model (whatever that is) is an older version that does worse than the newer version. If your conspiracy theory had any bearing in reality how come this fictional "real chess engine" was only used in a single release? Unbelievable.
Back in reality, it is well known that newer models that are made available to the public are adapted to business needs by constraining their capabilities and limit liability.
For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.
Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.
The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.
So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.
What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.
I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.
Abortion opponents view it as killing an innocent person. So that's unethical regardless of whether it's legal. I'm not contesting in any way that legal things can be unethical. Abortion supporters view it as a human right, and that right is more important than the law.
Right on red, underage drinking, and increasing car emissions aren't human rights. So outside of extenuating circumstances, if they're illegal, I see them as unethical.
Just like kanji are not native to Japanese.
I don't think the government's job is to enforce morality. The government's job is to set up a framework for society to help people get along.
My point though, is that in general, when there's not a right that outweighs the law, it's unethical to break the law.
To personify LLM way too much:
It sees that a prompt of some kind wants to play chess.
Knowing this it looks at the bag of “tools” and sees a chess tool. It then generates a response which eventually causes a call to a chess AI (or just chess program, potentially) which does further processing.
The first LLM acts as a ton of if-then statements, but automatically generated (or brute-forcly discovered) through training.
You still needed discrete parts for this system. Some communication protocol, an intent detection step, a chess execution step, etc…
I don’t see how that differs from a classic expert system other than the if statement is handled by a statistical model.
Here's an experiment: give an LLM a balanced middle game board position and ask it "play a new move that a creative grandmaster has discovered, never before played in chess and explain the tactics and strategy behind it". Repeat many times. Now analyse each move in an engine and look at the distribution of moves and responses. Hypothesis: It is going to come up with a bunch of moves all over the ratings map with some sound and some fallacious arguments.
I really don't think there's anything too mysterious going on here. It just synthesizes existing knowledge and gives answers that includes bit hits, big misses and everything in between. Creators chip away at the edges to change that distribution but the fundamental workings don't change.
Nonsense. There is zero chance in hell that if you combine the pictographs for "thing", "a", "ma", and "gibberish", that someone reading that is going to reproduce the sound thingamajibber. It just does not work. The meme does not replicate.
There may be other virtues of pictographic written language, but reproducing sounds is not one of them. And - as any Shakespeare fan will tell you - tweaking the sounds of English cleverly is rather important. If you can't reproduce this behavior, you're losing something in translation. So to speak.
Each Chinese character represents a syllable (in Chinese languages) or a small set of possible sequences of syllables (in Japanese).
And yes, in Chinese languages, new words are created from characters that sound like the parts of the new word, all the time.
We are pawns, hoping to be maybe a Rook to the King by endgame.
Some think we can promote our pawns to Queens to match.
Luckily, the Jester muses!
So there is a natural way to not just use a minimal bit or byte level tokenization, but every tokenization simultaneously: simply define your dataset to be a bunch of datapoints which are 'start-of-data token, then the byte encoding of a datapoint followed by the BPE encoding of that followed by the WordPiece encoding followed by ... until the end-of-data token'.
You need not actually store any of this on disk, you can compute it on the fly. So you can start by training only on the byte encoded parts, and then gradually switch to training only on the BPE indices, and then gradually switch to the WordPiece, and so on over the course of training. At no point do you need to change the tokenization or tokenizer (as far as the AUNN knows) and you can always switch back and forth or introduce new vocabularies on the fly, or whatever you want. (This means you can do many crazy things if you want. You could turn all documents into screenshots or PDFs, and feed in image tokens once in a while. Or why not video narrations? All it does is take up virtual indices, you don't have to ever train on them...)