Ive also invested some time in this space over the last several years. A group Im in took the approach of custom building agents for each CTF we approached. Our best so far was an agent participating in an AI CTF against current top injection/jailbreak and leakage defense techniques, the agent autonomously completed 22 of the 40 challenges and at one point held 8th place out of 380 teams. It eventually plateaued and slipped to 12th by the end.
The tooling and models are maturing quickly and there is definitely some value in autonomous security agents, both offensive and defensive- but also still requires alot of work, knowledge(my group is all ML people), skill, planning- if you want to approach anything more than bug bashing.
This recent paper from Dreadnode discusses a benchmark for this sort of challenge: https://arxiv.org/abs/2506.14682