←back to thread

695 points crescit_eundo | 9 comments | | HN request time: 1.053s | source | bottom
Show context
chvid ◴[] No.42144283[source]
Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.
replies(5): >>42144296 #>>42144326 #>>42144379 #>>42144517 #>>42156924 #
bubblyworld ◴[] No.42144326[source]
Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.
replies(2): >>42144427 #>>42144614 #
usrusr ◴[] No.42144614[source]
Could be a pilot implementation to learn about how to link up external specialist engines. Chess would be the obvious example to start with because the problem is so well known, standardized and specialist engines are easily available. If they ever want to offer an integration like that to customers (who might have some existing rule based engine in house), the need to know everything they can about expected cost, performance.
replies(1): >>42144821 #
bubblyworld ◴[] No.42144821[source]
This doesn't address its terrible performance. If it were touching anything like a real engine it would be playing at a superhuman level, not the level of a upper-tier beginner.
replies(2): >>42145541 #>>42148929 #
9dev ◴[] No.42145541[source]
That would have immediately given away that something must be off. If you want to do this in a subtle way that increases the hype around GPT-3.5 at the time, giving it a good-but-not-too-good rating would be the way to go.
replies(1): >>42147459 #
1. bubblyworld ◴[] No.42147459[source]
If you want to keep adding conditions to an already-complex theory, you'll need an equally complex set of observations to justify it.
replies(1): >>42148203 #
2. samatman ◴[] No.42148203[source]
You're the one imposing an additional criterion, that OpenAI must have chosen the highest setting on a chess engine, and demanding that this additional criterion be used to explain the facts.

I agree with GP that if a 'fine tuning' of GPT 3.5 came out the gate playing at top Stockfish level, people would have been extremely suspicious of that. So in my accounting of the unknowns here, the fact that it doesn't play at the top level provides no additional information with which to resolve the question.

replies(5): >>42148525 #>>42148570 #>>42148689 #>>42148759 #>>42154446 #
3. ◴[] No.42148525[source]
4. ◴[] No.42148570[source]
5. ◴[] No.42148689[source]
6. ◴[] No.42148759[source]
7. bubblyworld ◴[] No.42154446[source]
That's not an additional criterion, it's simply the most likely version of this hypothetical - a superhuman engine is much easier to integrate than an 1800 elo engine that makes invalid moves, for the simple reason that the vast majority of chess engines play at >1800 elo out of the box and don't make invalid moves ever (they are way past that level on a log-scale, actually).

This doesn't require the "highest" settings, it requires any settings whatsoever.

But anyway to spell out some of the huge list of unjustified conditions here:

1. OpenAI spent a lot of time and money R&Ding chess into 3.5-turbo-instruct via external call.

2. They used a terrible chess engine for some reason.

3. They did this deliberately because they didn't want to get "caught" for some reason.

4. They removed this functionality in all other versions of gpt for some reason ...etc

Much simpler theory:

1. They used more chess data training that model.

(there are other competing much simpler theories too)

replies(1): >>42157342 #
8. samatman ◴[] No.42157342{3}[source]
My point is that given a prior of 'wired in a chess engine', my posterior odds that they would make it plausibly-good and not implausibly-good approaches one.

For a variety of boring reasons, I'm nearly convinced that what they did was either, as you say, train heavily on chess texts, or a plausible variation of using mixture-of-experts and having one of them be an LLM chess savant.

Most of the sources I can find on the ELO of Stockfish at the lowest setting are around 1350, so that part also contributes no weights to the odds, because it's trivially possible to field a weak chess engine.

The distinction between prior and posterior odds is critical here. Given a decision to cheat (which I believe is counterfactual on priors), all of the things you're trying to Occam's Razor here are trivially easy to do.

So the only interesting considerations are the ones which factor into the likelihood of them deciding to cheat. If you even want to call it that, shelling out to a chess engine is defensible, although the stochastic fault injection (which is five lines of Python) in that explanation of the data does feel like cheating to me.

What I do consider relevant is that, based on what I know of LLMs, intensively training one to emit chess tokens seems almost banal in terms of outcomes. Also, while I don't trust OpenAI company culture much, I do think they're more interested in 'legitimately' weighting their products to pass benchmarks, or just building stuff with LLMs if you prefer.

I actually think their product would benefit from more code which detects "stuff normal programs should be doing" and uses them. There's been somewhat of a trend toward that, which makes the whole chatbot more useful. But I don't think that's what happened with this one edition of GPT 3.5.

replies(1): >>42175648 #
9. ◴[] No.42175648{4}[source]