That's the problem with closed models, we can never know what they're doing.
The only way it could be true is if that model recognized and replayed the answer to the game from memory.
If I tell an "agent", whether human or artificial, to win at chess, it is a good decision for that agent to decide to delegate that task to a system that is good at chess. This would be obvious to a human agent, so presumably it should be obvious to an AI as well.
This isn't useful for AI researchers, I suppose, but it's more useful as a tool.
(This may all be a good thing, as giving AIs true agency seems scary.)
As long as you are training it to make a tool call, you can add and remove anything you want behind the inference endpoint accessible to the public, and then you can plug the answer back into the chat ai, pass it through a moderation filter, and you might get good output from it with very little latency added.
None of these changes are explained to the LLM, so if it can tell it's still chess, it must deduce this on its own.
Would any LLM be able to play at a decent level?
These LLM's just exhibited agency.
Swallow your pride.
Most likely because they want people to think the system is better than it is for hype purposes.
I should temper my level of impressed with only if it’s doing this dynamically . Hardcoding recognition of chess moves isn’t exactly a difficult trick to pull given there’s like 3 standard formats…
>'thinking' vs 'just recombinating things
If there is a difference, and LLM's can do one but not the other... >By that standard (and it is a good standard), none of these "AI" things are doing any thinking
>"Does it generalize past the training data" has been a pre-registered goalpost since before the attention transformer architecture came on the scene.
Then what the fuck are they doing.Learning is thinking, reasoning, what have you.
Move goalposts, re-define words, it won't matter.
If I'm an undergrad doing a math assignment and want to check an answer, I may have no idea that symbolic algebra tools exist or how to use them. But if an all-purpose LLM gets a screenshot of a math equation and knows that its best option is to pass it along to one of those tools, that's valuable to me even if it isn't valuable to a mathematician who would have just cut out of the LLM middle-man and gone straight to the solver.
There are probably a billion examples like this. I'd imagine lots of people are clueless that software exists which can help them with some problem they have, so an LLM would be helpful for discovery even if it's just acting as a pass-through.
I mean it already hands off a wide range of tasks to python… this would be no different.
And the worker AIs "evolve" to meet/exceed expectations only on tasks directly contributing to KPIs the manager AIs measure for - via the mechanism of discarding the "less fit to exceed KPIs".
And some of the worker AIs who're trained on recent/polluted internet happen to spit out prompt injection attacks that work against the manager AIs rank stacking metrics and dominate over "less fit" worker AIs. (Congratulations, we've evolved AI cancer!) These manager AIs start performing spectacularly badly compared to other non-cancerous manager AIs, and die or get killed off by the VC's paying for their datacenters.
Competing manager AIs get training, perhaps on on newer HN posts discussing this emergent behavior of worker AIs, and start to down rank any exceptionally performing worker AIs. The overall trends towards mediocrity becomes inevitable.
Some greybread writes some Perl and regexes that outcompete commercial manager AIs on pretty much every real world task, while running on a 10 year old laptop instead of a cluster of nuclear powered AI datacenters all consuming a city's worth of fresh drinking water.
Nobody in powerful positions care. Humanity dies.
If a human said they could code, you don’t expect them to somehow turn into a Python interpreter and execute it in their brain. If a human said they could play chess, I’d raise an eyebrow if they just played the moves Stockfish gave them against me.
To personify LLM way too much:
It sees that a prompt of some kind wants to play chess.
Knowing this it looks at the bag of “tools” and sees a chess tool. It then generates a response which eventually causes a call to a chess AI (or just chess program, potentially) which does further processing.
The first LLM acts as a ton of if-then statements, but automatically generated (or brute-forcly discovered) through training.
You still needed discrete parts for this system. Some communication protocol, an intent detection step, a chess execution step, etc…
I don’t see how that differs from a classic expert system other than the if statement is handled by a statistical model.
We are pawns, hoping to be maybe a Rook to the King by endgame.
Some think we can promote our pawns to Queens to match.
Luckily, the Jester muses!