I read this as that this outlier version is connecting to an engine, and that this engine happens to get parameterized for a not particularly deep search depth.
If it's an exercise in integration they don't need to waste cycles on the engine playing awesome - it's enough for validation if the integration result is noticeably less bad than the LLM alone rambling about trying to sound like a chess expert.
If it's fine up to a certain depth it's much more likely that it was trained on an opening book imo.
What nobody has bothered to try and explain with this crazy theory is why would OpenAI care to do this at enormous expense to themselves?
Yeah, that thought crossed my mind as well. I dismissed that thought on the assumption that the measurements in the blog post weren't done from openings but from later stage game states, but I did not verify that assumption, I might have been wrong.
As for the insignificance of game cycles vs LLM cycles, sure. But if it's an integration experiment they might buy the chess API from some external service with a big disconnect between prices and cycle cost, or host one separately where they simply did not feel any need to bother with scaling mechanism if they can make it good enough for detection by calling with low depth parameters.
And the last uncertainty, here I'm much further out of my knowledge: we don't know how many calls to the engine a single promt might cause. Who knows how many cycles of "inner dialoge" refinement might run for a single prompt, and how often the chess engine might get consulted for prompts that aren't really related to chess before the guessing machine finally rejects that possibility. The amount of chess engine calls might be massive, big enough to make cycles per call a meaningful factor again.