The big step function here seems to be RL on tool calling.
Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).
OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).
o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.
My hunch is that these new models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7
tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.
[0]https://www.anthropic.com/engineering/building-effective-age...