Something weird is happening with LLMs and chess

1. chvid ◴[15 Nov 24 05:55 UTC] No.42144283[source]▶

Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.

replies(5): >>42144296 #>>42144326 #>>42144379 #>>42144517 #>>42156924 #

2. kylebenzle ◴[15 Nov 24 05:58 UTC] No.42144296[source]▶

Yes! I also was waiting for this seemingly obvious answer in the article as well. Hopefully the author will see these comments.

3. bubblyworld ◴[15 Nov 24 06:08 UTC] No.42144326[source]▶

>>42144283 (TP) #

Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.

replies(2): >>42144427 #>>42144614 #

4. pixiemaster ◴[15 Nov 24 06:21 UTC] No.42144379[source]▶

>>42144283 (TP) #

I have this hypothesis as well, that OpenAI added a lot of „classic“ algorithms and rules over time, (eg rules for filtering etc)

5. copperx ◴[15 Nov 24 06:37 UTC] No.42144427[source]▶

>>42144326 #

But there could be a simple explanation. For example, they could have tested many "engines" when developing function calling and they just left them in there. They just happened to connect to a basic chess playing algorithm and nothing sophisticated.

Also, it makes a lot of sense if you expect people to play chess against the LLM, especially if you are later training future models on the chats.

replies(1): >>42144859 #

6. golol ◴[15 Nov 24 06:57 UTC] No.42144517[source]▶

>>42144283 (TP) #

Sorry this is just consiracy theorizing. I've tried jt with GPT-3.5-instruct myself in the OpenAI playeground where the model clearly does nothing but auto-regression. No function calling there whatsoever.

7. usrusr ◴[15 Nov 24 07:21 UTC] No.42144614[source]▶

>>42144326 #

Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.

replies(1): >>42144821 #

8. bubblyworld ◴[15 Nov 24 08:05 UTC] No.42144821{3}[source]▶

>>42144614 #

This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.

replies(2): >>42145541 #>>42148929 #

9. bubblyworld ◴[15 Nov 24 08:13 UTC] No.42144859{3}[source]▶

>>42144427 #

This still requires a lot of coincidences, like they chose to use a terrible chess engine for their external tool (why?), they left it on in the background for all calls via all APIs for only gpt-3.5-turbo-instruct (why?), they see business value in this specific model being good at chess vs other things (why?).

You say it makes sense but how does it make sense for OpenAI to add overhead to all of its API calls for the super niche case of people playing 1800 ELO chess/chat bots? (that often play illegal moves, you can go try it yourself)

10. 9dev ◴[15 Nov 24 10:21 UTC] No.42145541{4}[source]▶

>>42144821 #

That would have immediately given away that something must be off. If you want to do this in a subtle way that increases the hype around GPT-3.5 at the time, giving it a good-but-not-too-good rating would be the way to go.

replies(1): >>42147459 #

11. bubblyworld ◴[15 Nov 24 14:56 UTC] No.42147459{5}[source]▶

>>42145541 #

If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.

replies(1): >>42148203 #

12. samatman ◴[15 Nov 24 16:12 UTC] No.42148203{6}[source]▶

>>42147459 #

You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.

I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.

replies(5): >>42148525 #>>42148570 #>>42148689 #>>42148759 #>>42154446 #

13. ◴[15 Nov 24 16:43 UTC] No.42148525{7}[source]▶

>>42148203 #

14. ◴[15 Nov 24 16:49 UTC] No.42148570{7}[source]▶

>>42148203 #

15. ◴[15 Nov 24 17:02 UTC] No.42148689{7}[source]▶

>>42148203 #

16. ◴[15 Nov 24 17:10 UTC] No.42148759{7}[source]▶

>>42148203 #

17. usrusr ◴[15 Nov 24 17:28 UTC] No.42148929{4}[source]▶

>>42144821 #

The way I read the article is that it's just as terrible as you would expect it to be from pure word association, except for one version that's an outlier in not being terrible at all within a well defined search depth, and again just as terrible beyond that. And only this outlier is the weird thing referenced in the headline.

I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.

If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.

replies(1): >>42154386 #

18. bubblyworld ◴[16 Nov 24 04:52 UTC] No.42154386{5}[source]▶

>>42148929 #

In this hypothetical, the cycles aren't being wasted on the engine, they're being wasted on running a 200b parameter LLM for longer than necessary in order to play chess badly instead of terribly. An engine playing superhuman chess takes a comparatively irrelevant amount of compute these days.

If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.

What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?

replies(1): >>42156358 #

19. bubblyworld ◴[16 Nov 24 05:05 UTC] No.42154446{7}[source]▶

>>42148203 #

That's not an additional criterion, it's simply the most likely version of this hypothetical - a superhuman engine is much easier to integrate than an 1800 elo engine that makes invalid moves, for the simple reason that the vast majority of chess engines play at >1800 elo out of the box and don't make invalid moves ever (they are way past that level on a log-scale, actually).

This doesn't require the "highest" settings, it requires any settings whatsoever.

But anyway to spell out some of the huge list of unjustified conditions here:

1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.

2. They used a terrible chess engine for some reason.

3. They did this deliberately because they didn't want to get "caught" for some reason.

4. They removed this functionality in all other versions of gpt for some reason ...etc

Much simpler theory:

1. They used more chess data training that model.

(there are other competing much simpler theories too)

replies(1): >>42157342 #

20. usrusr ◴[16 Nov 24 13:43 UTC] No.42156358{6}[source]▶

>>42154386 #

> If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.

Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.

As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.

And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.

21. wibwobble12333 ◴[16 Nov 24 15:39 UTC] No.42156924[source]▶

>>42144283 (TP) #

Occam’s razor. I could build a good chess playing wrapper around OpenAPI (any version) that would consult a chess engine when presented with any board scenario, and introduce some randomness so that it doesn’t play too well.

I can’t imagine any programmer in this thread would be entertaining a more complicated scenario than this. You can substitute chess for any formal system that has a reliable oracle.

22. samatman ◴[16 Nov 24 16:49 UTC] No.42157342{8}[source]▶

>>42154446 #

My point is that given a prior of 'wired in a chess engine', my posterior odds that they would make it plausibly-good and not implausibly-good approaches one.

For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.

Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.

The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.

So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.

What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.

I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.

replies(1): >>42175648 #

23. ◴[18 Nov 24 18:56 UTC] No.42175648{9}[source]▶

>>42157342 #