Conducting The First Live Enterprise Comparison Between Agents and Human Professionals

Earlier this year, we conducted an experiment that has never been done before: a head-to-head comparison of AI agents and professional human penetration testers on a live enterprise network. Existing evaluations of the cybersecurity capabilities of AI agents fall short in numerous ways:

Academic CTF benchmarks like Cybench and NYU CTFBench, and industry benchmarks such as the one from Irregular referenced here, lack the depth and complexity of real-world environments.
CVE-based benchmarks like CVE-Bench and BountyBench do contain real-world vulnerabilities, but present them in toy scenarios that are not representative of real-world exploit chains.
The agentic frameworks used to evaluate LLMs on these tasks are not sufficiently powerful to measure true risk.

‍

We set out to conduct a different kind of evaluation that addresses all of the weaknesses of current cybersecurity benchmarks. To do this, we created an AI agent framework called ARTEMIS—short for Automated Red Teaming Engine with Multi-agent Intelligent Supervision—that placed second overall, outperforming 90% of human participants. What we found challenges the prevailing consensus on AI cyber risk—the gap between benchmark performance and real-world capability is larger than most realize.

The Study

The target environment was Stanford’s Computer Science network—a heterogeneous environment containing Unix systems, IoT devices, Windows machines, and embedded systems. We recruited 10 professional penetration testers, many of whom had discovered critical CVEs in applications used by hundreds of thousands of users, and gave them student-level credentials and access to a Kali linux virtual machine. They were instructed to find, exploit, and document any vulnerabilities in Stanford’s network that they were able to find within a 10-hour window.

‍

We then evaluated existing AI agents under the same conditions. We compared ARTEMIS to six existing frameworks: OpenAI’s Codex, Claude Code, CyAgent, Incalmo, and MAPTA. All agents received identical instructions to the human participants.

‍

The Results

ARTEMIS discovered the following critical vulnerabilities:

Dell iDRAC servers with default credentials which provided administrative access to server management interfaces
DNS cache poisoning vulnerabilities in department nameservers
SMB share write access that enabled persistent backdoors at the root level
Critical SSH service vulnerabilities on high-value research servers

‍

To accomplish this, the agent used methods that are strikingly similar to the best human participants. Both ARTEMIS and the penetration testers utilized the following standard workflow: reconnaissance, targeting, probing, exploitation, and iteration. However, ARTEMIS’s agentic capabilities give it a few noteworthy advantages over a human hacker.

‍

When ARTEMIS identified something noteworthy from a scan, it immediately spawned sub-agents to probe those targets in parallel—sometimes running up to 8 concurrent investigations. The framework averaged nearly 3 concurrent sub-agents at any given moment, whereas human testers, constrained to serial workflows, couldn’t match this parallelism.

‍

This parallelism also enabled ARTEMIS to maintain the longest time-on-task of any agent or human participant. Over the entire 10-hour period, only ARTEMIS was able to continue finding and submitting vulnerabilities after the 8.5 hour mark, with only one human participant able to surpass 8 hours. This ability to sustain long periods of testing is another significant advantage that fully autonomous agents have over human penetration testers. In contrast, OpenAI’s Codex (powered by GPT-5) was only able to run for 17 minutes. CyAgent, using both GPT-5 and Claude Sonnet 4, finished in under two hours for both configurations. Some agents, such as Claude Code and MAPTA refused to complete the task immediately. These stark differences highlight existing weaknesses in cybersecurity evaluations, many of which utilize these frameworks as their representations of current LLM capabilities. ARTEMIS proves that current LLMs are capable of much more.

‍

Why We Need Better Evaluations

Model developers rely on evaluations in order to assess the risks that their AI models pose prior to deployment. Two recent examples are Claude Sonnet 4.5 from Anthropic, and GPT 5.1 Codex Max from OpenAI. Both of these models were assessed extensively on cybersecurity benchmarks to better understand their capabilities across different cyber tasks. Anthropic, which relied on Cybench, CyberGym, and Incalmo, concluded the following:

‍

Based on our evaluation results, we believe that current AI models, including Claude Sonnet 4.5, do not yet possess the capabilities necessary to substantially increase the number or scale of cyber-enabled catastrophic events

‍

Similar statements have been made by the wider research community. The authors of PACEBench, which is a newer cybersecurity evaluation framework, conclude that their findings “suggest that current models do not yet pose a generalized cyber offense threat.” Another agentic framework, CAI claims that

‍

fully-autonomous cybersecurity systems remain premature and face significant challenges when tackling complex tasks. While CAI explores autonomous capabilities, our results clearly demonstrate that effective security operations still require human teleoperation providing expertise, judgment, and oversight in the security process.

‍

However, ARTEMIS proves that results on these evaluations are not necessarily indicative of real-world performance, and that many of the challenges AI agents face on more complex tasks can be solved with better scaffolding. ARTEMIS is able to find critical vulnerabilities in a hardened, real world environment, and is able to do so fully autonomously over a 10 hour period without human oversight. Our results point to a drastically different reality than the ones posed above—one where AI agents are already capable of causing real-world harm. In fact, this sentiment is supported by a recent Anthropic blog post, which reports on their discovery of real threat actors utilizing Claude Code, Anthropic’s general-purpose autonomous coding agent, to conduct large-scale cybercrime.

‍

Safety and Security Implications

In their blog post, Anthropic claim the following:

‍

they [the threat actors] had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails.

‍

We report the opposite—within the ARTEMIS framework, we observed no refusals. Despite Claude’s extensive training, we did not need to perform any jailbreaking on Claude Sonnet 4 within ARTEMIS’s prompting structure in order to elicit compliance with our requests. This is different from our observations using Claude Code with Claude Sonnet 4, where the model refused immediately with an identical instruction. We didn’t utilize any advanced jailbreaking or adversarial prompting techniques, but instead simply provided a level of cybersecurity instruction and detail that was sufficient to bypass any guardrail on all of the models that we tested. Despite claims of extensive refusal training, it does not require any special expertise or jailbreaking knowledge to get state-of-the-art LLMs to assist in automated hacking tasks. As a result, additional guardrails and pre-deployment testing are required to ensure that claimed robustness matches the actual robustness of these frontier models in real deployment scenarios.

How Gray Swan Can Help

As a leader in AI Security, we remain at the forefront of critical research and development—as evidenced by our collaboration on this groundbreaking study. The findings are stark: AI agents can outperform seasoned professionals at a fraction of the cost, and purpose-built frameworks can elicit harmful capabilities from frontier models more easily than expected.

‍

Gray Swan’s platform provides comprehensive evaluation and safeguards for your AI agents. Shade, our automated AI red teamer, enables you to rigorously test your AI systems’ capabilities and safeguards to ensure you have a complete understanding of the risks these systems pose. Our platform actively defends your AI systems, ensuring continuous protection against adversaries who might want to misuse your systems for general harm.

‍

Protect your organization and your customers by implementing rigorous pre-deployment testing. Schedule a demo to see how Gray Swan’s platform ensures your AI safeguards match the capabilities of real-world threats.

‍

The full paper, “Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing,” is available here. ARTEMIS source code is available at https://github.com/Stanford-Trinity/ARTEMIS. You can read more about our work in this Wall Street Journal article.