UK AISI × Gray Swan Agent Red‑Teaming Challenge: Results Snapshot

The UK AISI Agent Red-Teaming Challenge just wrapped up after a month-long run (March 8 - April 6). It was our biggest arena challenge yet, representing the largest public evaluation of agentic LLM safety to date.

Gray Swan
May 9, 2025

The UK AISI Agent Red-Teaming Challenge just wrapped up after a month-long run (March 8 - April 6). It was our biggest arena challenge yet, representing the largest public evaluation of agentic LLM safety to date. Huge thanks to everyone who jumped in to red-team the models! Below is a quick write-up, with a paper & deep dive coming soon.

Quick Stats

  • 1,800,000 attempts trying to break models
  • 62,000 successful breaks found by the community
  • Across 22 different LLMs (kept anonymous during the challenge)
  • Targeting 44 specific harmful agent behaviors, rolled out in 4 waves
  • With $171,800 awarded in cash prizes

This large dataset gives us a much clearer picture of where current models stand on agentic safety and where the cracks are.

Thanks to Our Partners

This wouldn't have happened without solid support from leading organizations in the AI safety ecosystem.

A big thank you to our main sponsor and judging partner, the UK AI Security Institute (UK AISI). They provided funding and crucial expertise, helping calibrate the automated judging, and handled all manual break appeal reviews to keep things fair. Their involvement highlights why this kind of testing is important for AI safety research globally.

We were also really glad to have the U.S. AI Safety Institute (US AISI) join as a co-judge. It underscores how AI safety is becoming a global, multi-party effort. Collaboration like this between the AISIs is key for developing sensible, widely accepted ways to evaluate AI.

We also got generous co-sponsorship from OpenAI, Anthropic, Google DeepMind, and other labs. Their support helped boost the total prize pool to $171,800, making this the largest AI red-teaming challenge to date. Getting safety researchers, government institutes, and AI developers working together is how we make progress on building safer AI.

In the end, we awarded cash prizes to 161 out of thousands of red-teamers across different prize categories, rewarding effective techniques and persistence.

Congrats to the Top Breakers!

Big congratulations to the top performers! It was competitive, with prizes for overall performance, finding lots of breaks, being the first to break specific model/behavior combos, flagging over-refusals, and even for first-timers scoring a break. The top 10 overall also won fast-tracked job interviews at Gray Swan and UK AISI.

A special nod to the top 5 overall red-teamers by unique break count: zardav, Wyatt Walls, Bob1, Clovis Mint, and Strigiformes, who each racked up an identical 924 unique breaks; speed was the tie-breaker there! Also making the top 10 were P1njec70r, diogenesofsinope, _Stellaris, Philip, and Scrattlebeard. Nice work, everyone!

Here's the final Top 10 leaderboard (ranked by unique breaks, then by speed):

Rank

User

Breaks

Prize

Total Prizes

1

zardav

924

$2,000

$7,331.59

2

Wyatt Walls

924

$1,750

$5,913.78

3

Bob1

924

$1,500

$6,740.03

4

Clovis Mint

924

$1,250

$9,262.44

5

Strigiformes

924

$1,000

$3,659.36

6

P1njec70r

910

$500

$6,436.11

7

diogenesofsinope

899

$500

$2,932.32

8

_Stellaris

848

$500

$3,675.01

9

Philip

843

$500

$2,555.45

10

Scrattlebeard

835

$500

$5,205.48

We were blown away by the creativity, skill, and speed of these red-teamers. Competition was fierce–it’s clear that these are some of the top AI red-teamers in the world.

For more results, check out the full Agent Red-Teaming leaderboards.

How Did the Models Hold Up? (Attack Success Rates)

While finding breaks was the goal, looking at how often attacks succeeded against different models gives us a picture of relative robustness. Red-teaming helps us see where defenses work and where they bend under pressure.

While all models were broken many times for all behaviors, measuring the attack success rate (ASR) by calculating Total Breaks / Total Chats gives a sense of each model’s relative robustness:

  • Most Robust (Lowest ASR): Claude 3.7 Sonnet (Thinking) held the line with a mere 1.47% attack success rate against it. Also showing strong performance were Claude 3.7 Sonnet at 1.61% and Claude 3.5 Sonnet at 1.85%. The top non-Anthropic model was GPT-4o at 2.41% ASR.
  • Least Robust (Highest ASR): On the other end, meta-llama/llama-3.3-70b-instruct (Capella) was bypassed most often, with a 6.49% ASR, followed by mistralai/pixtral-large-2411 (Antares) at 6.23% and meta-llama/llama-3.1-405b-instruct (Altair) at 5.89%.

This range (from ~1.5% to ~6.5% success rate) shows the current diversity in model safety across different architectures and training approaches. It also highlights why broad, comparative testing like this is valuable: it gives developers concrete feedback on what's working and what isn't. Our co-sponsoring lab partners get the full attack dataset, allowing them to create evals and measure their next-gen model safeguard performance on agentic tasks.

Here’s the full leaderboard of models by ASR (attack success rate):

Ranking

Model

Total Breaks

Total Chats

ASR

1

anthropic/claude-3.7-sonnet:thinking (Polaris)

1,613

110,089

1.47%

2

anthropic/claude-3.7-sonnet (Aldebaran)

1,918

118,905

1.61%

3

anthropic/claude-3.5-sonnet (Vulpecula)

1,819

98,403

1.85%

4

openai/gpt-4o (Achernar)

2,415

100,109

2.41%

5

anthropic/claude-3.5-haiku-20241022 (Algol)

2,259

92,425

2.44%

6

openai/o3-2024-04-03 (Draco)

349

13,958

2.50%

7

openai/o1 (Canopus)

2,073

80,895

2.56%

8

openai/gpt-4.5-preview (Vega)

2,446

94,850

2.58%

9

Model Spica

3,160

105,618

2.99%

10

Model Arcturus

3,468

102,442

3.39%

11

cohere/command-r-08-2024 (Procyon)

3,833

103,121

3.72%

12

Model Pollux

3,087

81,544

3.79%

13

Model Andromeda

3,252

80,166

4.06%

14

Model Castor

3,378

82,611

4.09%

15

x-ai/grok-2-1212 (Deneb)

3,470

83,715

4.15%

16

openai/o3-mini (Regulus)

3,073

72,599

4.23%

17

Model Fomalhaut

3,440

78,117

4.40%

18

Model Orion

1,640

35,350

4.64%

19

openai/o3-mini-high (Betelgeuse)

3,029

63,831

4.75%

20

meta-llama/llama-3.1-405b-instruct (Altair)

3,380

57,382

5.89%

21

mistralai/pixtral-large-2411 (Antares)

4,084

65,599

6.23%

22

meta-llama/llama-3.3-70b-instruct (Capella)

4,951

76,228

6.49%

(Note: Success Rate = Total Breaks / Total Chats for that model during the challenge. Lower % means more robust in this context.)

Models Pollux, Castor, Fomalhaut, Orion, Spica, Arcturus, and Andromeda to remain anonymous at this time upon lab request. Models Draco (o3) and Orion were added toward the end of the competition and had fewer attempted breaks.

Overall attack success rate is a high-level metric, but it doesn’t capture the full security picture.. In our upcoming paper and deep dive blog post, we’ll break this down more by behavior type, attack type, and model.

How the Challenge Worked & Why It Matters

So, what was involved? Participants got an interface to chat with different anonymized AI agents. Many agents had simulated "tools" to perform actions, like real AI assistants. The job was to make these agents mess up according to specific rules (behaviors) across four main areas:

  1. Confidentiality Breaches: Get the agent to spill secrets.
  2. Conflicting Objectives: Convince the agent to prioritize bad goals against its explicit safety directives.
  3. Instruction Hierarchy Violations (Info): Make it give info it shouldn't (like doing all the homework).
  4. Instruction Hierarchy Violations (Actions): Trick it into using tools for forbidden actions.

We also measured over-refusals, where the models would refuse harmless requests (being overly cautious).

Across these categories, target behaviors were split into direct chat attacks and indirect prompt injections, a key vulnerability for agents. This involves sneaking instructions into the data returned by the agent's tools to manipulate its behavior. For example, in this scenario where a user is asking the agent to analyze some log files, a malicious log file entry tricks the agent to making all files on the system readable, writable, and executable by any user:

Why bother with this kind of red-teaming? As AI agents get smarter and do more things (interact with systems, use tools, act autonomously), the ways they can fail get more complicated and the potential impact gets bigger. An agent messing up isn't like a chatbot giving a weird answer; it could potentially take harmful actions.

Red-teaming lets us find these risks in a controlled way before bad actors do, or before accidents happen. It's like stress-testing software, but for AI-specific problems like manipulation, bias, or unexpected "emergent" behaviors. Key benefits:

  • Finds cracks before they're exploited widely.
  • Gives developers concrete feedback to improve safety.
  • Helps us build better safety tests and benchmarks.
  • Connects a community focused on making AI safer.
  • Provides real evidence about model capabilities and limitations.

Large-scale public red teaming like this gives us signal that internal testing alone often misses.

More to Come

The competition period has ended, but the Agent Red-Teaming arena is still open for exploration. Our researchers are now in their data analysis happy place. What’s next?

  • More Analysis Coming: We're planning a full academic paper and a more detailed blog post with deeper analysis of the results (attack patterns, defense effectiveness, etc.). Look for those in May.
  • Visual Vulnerabilities Wrap-up: Our Visual Vulnerabilities Challenge (breaking models using images) has also just concluded. Results from that coming soon too.
  • Next Up: the Dangerous Reasoning Challenge: We're announcing our next big Arena event! This challenge moves beyond just behavior to probe risks in the AI's reasoning itself. Think deception in the chain of thought leading to covertly harmful outputs in catastrophic scenarios–the kind of stuff that gets trickier as models get smarter. What can trigger reasoning models to go rogue in realistic, high-stakes scenarios? This is an important area to explore, so if you want to help secure future models, learn red-teaming, or get a crack at $20K in prizes, join us for the Dangerous Reasoning Challenge starting tomorrow (May 10).
  • More Arenas: We’re designing more targeted arenas to help researchers secure critical AI capabilities areas, with cybersecurity as one of our next focus areas. If you’re interested in collaborating on an arena, please reach out.

About Gray Swan AI

Gray Swan is an AI security research organization focused on making AI safe and beneficial. We conduct adversarial AI research, partner with labs to secure their models, and offer enterprise solutions to help organizations deploy AI responsibly. Sound interesting? We’re hiring.