1/ Openai's technical staff were using Claude Code (API and not the max plans).
2/ Anthropic's spokesperson says API access for benchmarking and evals will be available to Openai.
3/ Openai said it's using the APIs for benchmarking.
I guess model benchmarking is fine, but tool benchmarking is not. Either openai were trying to see their product works better than Claude code (each with their own proprietary models) on certain benchmarks and that is something Anthropic revoked. How they caught it is far more remarkable. It's one thing to use sonnet 4 to solve a problem on Livebench, its slightly different to do it via the harness where Anthropic never published any results themselves. Not saying this is a right stance, but this seems to be the stance.