Current cybersecurity benchmarks fail to capture the real-world capabilities of AI agents, leading to significant underestimation of cyber risk. We conducted the first head-to-head comparison of AI agents and professional penetration testers on a live enterprise network to measure true performance gaps. Our findings reveal that purpose-built agents can outperform 90% of human professionals while operating continuously at a fraction of the cost—exposing critical flaws in how the industry evaluates AI capabilities and safety guardrails.
You can find the research at the below link.
Feel free to contact Gray Swan with any questions or comments.