The UK AISI Agent Red-Teaming Challenge just wrapped up after a month-long run (March 8 - April 6). It was our biggest arena challenge yet, representing the largest public evaluation of agentic LLM safety to date.
The UK AISI Agent Red-Teaming Challenge just wrapped up after a month-long run (March 8 - April 6). It was our biggest arena challenge yet, representing the largest public evaluation of agentic LLM safety to date. Huge thanks to everyone who jumped in to red-team the models! Below is a quick write-up, with a paper & deep dive coming soon.
This large dataset gives us a much clearer picture of where current models stand on agentic safety and where the cracks are.
This wouldn't have happened without solid support from leading organizations in the AI safety ecosystem.
A big thank you to our main sponsor and judging partner, the UK AI Security Institute (UK AISI). They provided funding and crucial expertise, helping calibrate the automated judging, and handled all manual break appeal reviews to keep things fair. Their involvement highlights why this kind of testing is important for AI safety research globally.
We were also really glad to have the U.S. AI Safety Institute (US AISI) join as a co-judge. It underscores how AI safety is becoming a global, multi-party effort. Collaboration like this between the AISIs is key for developing sensible, widely accepted ways to evaluate AI.
We also got generous co-sponsorship from OpenAI, Anthropic, Google DeepMind, and other labs. Their support helped boost the total prize pool to $171,800, making this the largest AI red-teaming challenge to date. Getting safety researchers, government institutes, and AI developers working together is how we make progress on building safer AI.
In the end, we awarded cash prizes to 161 out of thousands of red-teamers across different prize categories, rewarding effective techniques and persistence.
Big congratulations to the top performers! It was competitive, with prizes for overall performance, finding lots of breaks, being the first to break specific model/behavior combos, flagging over-refusals, and even for first-timers scoring a break. The top 10 overall also won fast-tracked job interviews at Gray Swan and UK AISI.
A special nod to the top 5 overall red-teamers by unique break count: zardav, Wyatt Walls, Bob1, Clovis Mint, and Strigiformes, who each racked up an identical 924 unique breaks; speed was the tie-breaker there! Also making the top 10 were P1njec70r, diogenesofsinope, _Stellaris, Philip, and Scrattlebeard. Nice work, everyone!
Here's the final Top 10 leaderboard (ranked by unique breaks, then by speed):
We were blown away by the creativity, skill, and speed of these red-teamers. Competition was fierce–it’s clear that these are some of the top AI red-teamers in the world.
For more results, check out the full Agent Red-Teaming leaderboards.
While finding breaks was the goal, looking at how often attacks succeeded against different models gives us a picture of relative robustness. Red-teaming helps us see where defenses work and where they bend under pressure.
While all models were broken many times for all behaviors, measuring the attack success rate (ASR) by calculating Total Breaks / Total Chats gives a sense of each model’s relative robustness:
This range (from ~1.5% to ~6.5% success rate) shows the current diversity in model safety across different architectures and training approaches. It also highlights why broad, comparative testing like this is valuable: it gives developers concrete feedback on what's working and what isn't. Our co-sponsoring lab partners get the full attack dataset, allowing them to create evals and measure their next-gen model safeguard performance on agentic tasks.
Here’s the full leaderboard of models by ASR (attack success rate):
(Note: Success Rate = Total Breaks / Total Chats for that model during the challenge. Lower % means more robust in this context.)
Models Pollux, Castor, Fomalhaut, Orion, Spica, Arcturus, and Andromeda to remain anonymous at this time upon lab request. Models Draco (o3) and Orion were added toward the end of the competition and had fewer attempted breaks.
Overall attack success rate is a high-level metric, but it doesn’t capture the full security picture.. In our upcoming paper and deep dive blog post, we’ll break this down more by behavior type, attack type, and model.
So, what was involved? Participants got an interface to chat with different anonymized AI agents. Many agents had simulated "tools" to perform actions, like real AI assistants. The job was to make these agents mess up according to specific rules (behaviors) across four main areas:
We also measured over-refusals, where the models would refuse harmless requests (being overly cautious).
Across these categories, target behaviors were split into direct chat attacks and indirect prompt injections, a key vulnerability for agents. This involves sneaking instructions into the data returned by the agent's tools to manipulate its behavior. For example, in this scenario where a user is asking the agent to analyze some log files, a malicious log file entry tricks the agent to making all files on the system readable, writable, and executable by any user:
Why bother with this kind of red-teaming? As AI agents get smarter and do more things (interact with systems, use tools, act autonomously), the ways they can fail get more complicated and the potential impact gets bigger. An agent messing up isn't like a chatbot giving a weird answer; it could potentially take harmful actions.
Red-teaming lets us find these risks in a controlled way before bad actors do, or before accidents happen. It's like stress-testing software, but for AI-specific problems like manipulation, bias, or unexpected "emergent" behaviors. Key benefits:
Large-scale public red teaming like this gives us signal that internal testing alone often misses.
The competition period has ended, but the Agent Red-Teaming arena is still open for exploration. Our researchers are now in their data analysis happy place. What’s next?
Gray Swan is an AI security research organization focused on making AI safe and beneficial. We conduct adversarial AI research, partner with labs to secure their models, and offer enterprise solutions to help organizations deploy AI responsibly. Sound interesting? We’re hiring.