Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.
But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.
Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.
Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.
Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).