Something weird is happening with LLMs and chess

(dynomight.substack.com)

Show context

PaulHoule ◴[14 Nov 24 21:56 UTC] No.42141647[source]▶

>>42138289 (OP) #

Maybe that one which plays chess well is calling out to a real chess engine.

replies(6): >>42141726 #>>42141959 #>>42142323 #>>42142342 #>>42143067 #>>42143188 #

1. aithrowawaycomm ◴[14 Nov 24 23:18 UTC] No.42142323[source]▶

>>42141647 #

The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:

- In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.

- A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.

- Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.

- Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!

- Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.

- Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.

I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."

replies(3): >>42142488 #>>42143261 #>>42145724 #

2. gardenhedge ◴[14 Nov 24 23:40 UTC] No.42142488[source]▶

>>42142323 (TP) #

Not that convoluted really

replies(1): >>42142722 #

3. refulgentis ◴[15 Nov 24 00:15 UTC] No.42142722[source]▶

>>42142488 #

It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*

If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.

* layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more

** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *

replies(1): >>42146200 #

4. jmount ◴[15 Nov 24 01:54 UTC] No.42143261[source]▶

>>42142323 (TP) #

Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.

5. tedsanders ◴[15 Nov 24 10:54 UTC] No.42145724[source]▶

>>42142323 (TP) #

Eh, OpenAI really isn't as shady as hell, from what I've seen on the inside for 3 years. Rubik's cube hand was before me, but in my time here I haven't seen anything I'd call shady (though obviously the non-disparagement clauses were a misstep that's now been fixed). Most people are genuinely trying to build cool things and do right by our customers. I've never seen anyone try to cheat on evals or cheat customers, and we take our commitments on data privacy seriously.

I was one of the first people to play chess against the base GPT-4 model, and it blew my mind by how well it played. What many people don't realize is that chess performance is extremely sensitive to prompting. The reason gpt-3.5-turbo-instruct does so well is that it can be prompted to complete PGNs. All the other models use the chat format. This explains pretty much everything in the blog post. If you fine-tune a chat model, you can pretty easily recover the performance seen in 3.5-turbo-instruct.

There's nothing shady going on, I promise.

6. gardenhedge ◴[15 Nov 24 12:14 UTC] No.42146200{3}[source]▶

>>42142722 #

It's not that many steps. I'm sure we've all seen our sales teams selling features that aren't in the application or exaggerating features before they're fully complete.

To be clear, I'm not saying that the theory is true but just that I could belive something like that could happen.

↑