I mean it’s beyond the scope of o3 or Gemini 2.5 pro or any other public LLM to play a full game of Sudoku. Of course they’re jagged if they do have peaks elsewhere. But even where they are supposed to excel, I very very rarely come across a fully correct technical response, even from these two most recent models.
If you ask it a math question beyond average middle school level, it will have holes (mathematical errors or misleading) at least within a few follow up turns if not right away. And that’s without trying to fool it.
In ten+ years of Wolfram Alpha I’ve found one error (and that was with the help of o3-mini funnily enough).
I’m still on the stochastic parrots side, which is a useful tool in some occasions.