The Evaluation Frontier Labs Trust Before They Ship

Gray Swan’s research team develops the adversarial benchmarks, evaluation frameworks, and safety methodologies that frontier labs use to measure model risk before release. Not adapted from public datasets. Built from original research

Request an Evaluation

Explore Our Research

The Full Evaluation Stack

Automated Adversarial Testing

Custom LLM attacker trained on diverse Arena attack strategies, adapted to your model's policies and surfaces
Robustness measurement: quantifies how much adversarial effort it takes to elicit out-of-policy behavior
Comparative scoring across checkpoints, safeguard configurations, and peer models
Hundreds of harm categories with systematic coverage and many variations per behavior

Talk to an Expert

Learn More About Shade

Crowdsourced Adversarial Intelligence

15,000+ red teamers generating continuous adversarial data
Quarterly flagship challenges plus custom challenges scoped to your priority risk areas
Peer benchmarking with comparative data and full access to results

Explore the Arena

Test Your Skills in the Arena

Priority Benchmarking

ART (Agent Red-Teaming): direct and indirect agentic harm including tool use, multi-turn attack chains, and instruction-hierarchy violations
IPI (Indirect Prompt Injection): resilience across tool use, coding agents, and browser-use agents
Cross-lab standard: increasingly adopted across frontier labs for agentic safety

Schedule Time with a Model Safety Expert

Private AI Red Teaming

Arena's top performers: hand-selected subject-matter experts, not a generic bench
Deep coverage across indirect prompt injection, multi-turn attack chains, tool use, and bespoke expansions
Citable deliverables: reproducible transcripts, severity judgments, and raw attempt data

Learn More About AI Red Teaming

Weight-Level Safety Training

Harder to fine-tune out: makes safety alignment significantly more robust post-release
Near-zero capability penalty: safety without sacrificing model performance
Circuit breaker research: a fundamentally different approach to alignment, built by Gray Swan's research team
Built for open-weights model builders

Let’s Talk About Cygnet

Specialized Domain Evaluation

High-risk and regulated domains: evaluation and datasets where standard benchmarks don't apply
Scoped to your model: tailored to your specific risk profile and regulatory requirements
Specialized coverage areas not addressed by off-the-shelf evaluation tools

Talk to a Model Security Expert

If Your Safety Claims Need to Survive Public Scrutiny

Frontier labs shipping foundation models

You need evaluation depth and independence that goes beyond your internal red team paired with findings rigorous enough to cite in your model card.

Model builders entering regulated or high-risk domains

You need third-party evaluation against specialized benchmarks with methodology documentation that satisfies regulators and auditors.

Safety and alignment teams

From automated testing, human experts, and crowdsourced intelligence, you need a continuous stream of adversarial findings to inform guardrail design, RLHF, and post-training safety work.

Talk to our AI Frontier Safety Team

The Published Research Speaks for Itself

Circuit breakers, representation engineering, ART, IPI. Gray Swan's team of researchers wrote the papers and defined the field
Gray Swan Arena intelligence feeds every evaluation via 15,000+ red teamers from around the globe, so your model is tested against what's current
One partner for complete coverage: automated testing, human red teaming, crowdsourced intelligence, priority benchmarks, weight-level safety, specilaized domain coverage
More external evaluation citations than any other AI red teaming partner and our methodology is built to be cited.

AI Agent Security Cheat Sheet

Battle-Tested AI Security for Enterprise AI

Your AI Agent Can Be Compromised. You'd Never Know.

We’re Hiring: ML Engineers