> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.
A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.
[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...