Something weird is happening with LLMs and chess

(dynomight.substack.com)

696 points crescit_eundo | 3 comments | 14 Nov 24 17:05 UTC | HN request time: 0.755s | source

Show context

chvid ◴[15 Nov 24 05:55 UTC] No.42144283[source]▶

>>42138289 (OP) #

Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.

replies(5): >>42144296 #>>42144326 #>>42144379 #>>42144517 #>>42156924 #

bubblyworld ◴[15 Nov 24 06:08 UTC] No.42144326[source]▶

>>42144283 #

Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.

replies(2): >>42144427 #>>42144614 #

usrusr ◴[15 Nov 24 07:21 UTC] No.42144614[source]▶

>>42144326 #

Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.

replies(1): >>42144821 #

bubblyworld ◴[15 Nov 24 08:05 UTC] No.42144821[source]▶

>>42144614 #

This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.

replies(2): >>42145541 #>>42148929 #

9dev ◴[15 Nov 24 10:21 UTC] No.42145541[source]▶

>>42144821 #

That would have immediately given away that something must be off. If you want to do this in a subtle way that increases the hype around GPT-3.5 at the time, giving it a good-but-not-too-good rating would be the way to go.

replies(1): >>42147459 #

bubblyworld ◴[15 Nov 24 14:56 UTC] No.42147459[source]▶

>>42145541 #

If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.

replies(1): >>42148203 #

samatman ◴[15 Nov 24 16:12 UTC] No.42148203[source]▶

>>42147459 #

You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.

I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.

replies(5): >>42148525 #>>42148570 #>>42148689 #>>42148759 #>>42154446 #

1. bubblyworld ◴[16 Nov 24 05:05 UTC] No.42154446[source]▶

>>42148203 #

That's not an additional criterion, it's simply the most likely version of this hypothetical - a superhuman engine is much easier to integrate than an 1800 elo engine that makes invalid moves, for the simple reason that the vast majority of chess engines play at >1800 elo out of the box and don't make invalid moves ever (they are way past that level on a log-scale, actually).

This doesn't require the "highest" settings, it requires any settings whatsoever.

But anyway to spell out some of the huge list of unjustified conditions here:

1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.

2. They used a terrible chess engine for some reason.

3. They did this deliberately because they didn't want to get "caught" for some reason.

4. They removed this functionality in all other versions of gpt for some reason ...etc

Much simpler theory:

1. They used more chess data training that model.

(there are other competing much simpler theories too)

replies(1): >>42157342 #

2. samatman ◴[16 Nov 24 16:49 UTC] No.42157342[source]▶

>>42154446 (TP) #

My point is that given a prior of 'wired in a chess engine', my posterior odds that they would make it plausibly-good and not implausibly-good approaches one.

For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.

Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.

The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.

So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.

What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.

I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.

replies(1): >>42175648 #

3. ◴[18 Nov 24 18:56 UTC] No.42175648[source]▶

>>42157342 #

↑