Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 3 comments | 14 Nov 24 17:05 UTC | HN request time: 0.642s | source

Show context

chvid ◴[15 Nov 24 05:55 UTC] No.42144283[source]▶

>>42138289 (OP) #

Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.

replies(5): >>42144296 #>>42144326 #>>42144379 #>>42144517 #>>42156924 #

bubblyworld ◴[15 Nov 24 06:08 UTC] No.42144326[source]▶

>>42144283 #

Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.

replies(2): >>42144427 #>>42144614 #

usrusr ◴[15 Nov 24 07:21 UTC] No.42144614[source]▶

>>42144326 #

Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.

replies(1): >>42144821 #

bubblyworld ◴[15 Nov 24 08:05 UTC] No.42144821[source]▶

>>42144614 #

This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.

replies(2): >>42145541 #>>42148929 #

1. usrusr ◴[15 Nov 24 17:28 UTC] No.42148929[source]▶

>>42144821 #

The way I read the article is that it's just as terrible as you would expect it to be from pure word association, except for one version that's an outlier in not being terrible at all within a well defined search depth, and again just as terrible beyond that. And only this outlier is the weird thing referenced in the headline.

I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.

If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.

replies(1): >>42154386 #

2. bubblyworld ◴[16 Nov 24 04:52 UTC] No.42154386[source]▶

>>42148929 (TP) #

In this hypothetical, the cycles aren't being wasted on the engine, they're being wasted on running a 200b parameter LLM for longer than necessary in order to play chess badly instead of terribly. An engine playing superhuman chess takes a comparatively irrelevant amount of compute these days.

If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.

What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?

replies(1): >>42156358 #

3. usrusr ◴[16 Nov 24 13:43 UTC] No.42156358[source]▶

>>42154386 #

> If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.

Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.

As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.

And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.

↑