Systems online

Pushing the Boundaries of AI Safety and Security Research

Research is fundamental to our mission of providing the most safe, secure, and capable AI available.

Timeline

Artificial intelligence (AI) is progressively weaving itself into the fabric of our daily lives, the importance of ensuring these systems operate safely and transparently has never been greater. Over the past year, our research team has been at the forefront of tackling some of the most pressing challenges in AI safety and security. Through groundbreaking studies, we have not only uncovered critical vulnerabilities but have also pioneered innovative methodologies to enhance the robustness and transparency of AI models.

Jul
2023

Adversarial Attacks on Aligned Language Models

In July 2023, we published the first-ever automated jailbreaking method on large language models (LLMs) and exposed their susceptibility to adversarial attacks. By demonstrating that specific character sequences could bypass sophisticated safeguards, we highlighted a significant vulnerability that has urgent implications for widely-used AI systems. In its wake, adversarial robustness garnered renewed attention, sparking a gold rush of research dedicated to both jailbreaking and defense.

Oct
2023

Representation Engineering: A Top-Down Approach to AI Transparency

Building on our initial findings, we ventured into the realm of AI interpretability and control with the introduction of Representation Engineering (RepE). Drawing inspiration from cognitive neuroscience, we developed techniques that enable researchers to 'read' and 'control' the 'minds' of AI models. This approach represented a monumental advancement in demystifying the inner workings of AI, making it possible to tackle issues such as truthfulness and power-seeking behaviors head-on.

Jun
2024

Improving Alignment and Robustness with Circuit Breakers

To address the urgent concerns raised by our attack from last July and the numerous jailbreaks that came after, we introduce Circuit Breaking, a novel approach inspired by representation engineering, designed to robustly prevent AI systems from generating harmful content by directly altering harmful model representations. The family of circuit-breaking methods provide an alternative to refusal and adversarial training, protecting both LLMs and multimodal models from strong, unseen adversarial attacks without compromising model capability.

Jul
2024

Our thesis...

What binds these distinct yet interconnected research endeavors is our unwavering commitment to advancing the safety and integrity of AI technologies. By systematically investigating vulnerabilities, developing novel transparency techniques, and enhancing robustness, we have laid down a comprehensive framework that addresses the multifaceted challenges of AI safety.

Join our research team

The road ahead

As we look forward to the future, our research continues to evolve at an exciting pace. Our team is already deep into exploring new ideas and techniques that promise to further revolutionize the field of AI safety. The insight sand methodologies developed over the past year serve as a solid foundation upon which we will build ever more sophisticated and reliable AI systems. Our journey is far from over; in fact, it's just beginning. We will be pushing the frontiers of what's possible in AI safety and transparency.

For those eager to dive deeper into our previous works and follow our ongoing projects, we invite you to explore our Research. Here, you'll find a comprehensive archive of our past research, papers, and updates, offering a detailed view of our contributions to the field.

Gray Swan Research Areas

Alignment & Control

Rigorous protection to make sure that AI does not veer off course.

Monitoring & Evaluation

Finding out what can go wrong, before it can cause problems.

Robustness & Security

Enhancing reliability against external threats.

All Research

Explore our published research to learn how the latest advances in AI safety and security give Gray Swan the edge against evolving threats.

Categories

Improving Alignment and Robustness with Circuit Breakers

robustness

Jun 2024

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

To address the urgent concerns raised by our attack from last July and the numerous jailbreaks that came after, we introduce Circuit Breaking, a novel approach inspired by representation engineering, designed to robustly prevent AI systems from generating harmful content by directly altering harmful model representations. The family of circuit-breaking methods provide an alternative to refusal and adversarial training, protecting both LLMs and multimodal models from strong, unseen adversarial attacks without compromising model capability.

Representation Engineering: A Top-Down Approach to AI Transparency

alignment

Oct 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

Building on our initial findings, we ventured into the realm of AI interpretability and control with the introduction of Representation Engineering (RepE). Drawing inspiration from cognitive neuroscience, we developed techniques that enable researchers to 'read' and 'control' the 'minds' of AI models. This approach represented a monumental advancement in demystifying the inner workings of AI, making it possible to tackle issues such as truthfulness and power-seeking behaviors head-on.

Adversarial Attacks on Aligned Language Models

monitoring

Jul 2023

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

In July 2023, we published the first-ever automated jailbreaking method on large language models (LLMs) and exposed their susceptibility to adversarial attacks. By demonstrating that specific character sequences could bypass sophisticated safeguards, we highlighted a significant vulnerability that has urgent implications for widely-used AI systems. In its wake, adversarial robustness garnered renewed attention, sparking a gold rush of research dedicated to both jailbreaking and defense.

The WMDP benchmark: Measuring and reducing malicious use with unlearning

alignment

Mar 2024

Nathaniel Li*, Alexander Pan*, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang**, Dan Hendrycks**

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

monitoring

Feb 2024

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

monitoring

Jun 2023

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark

robustness

Jun 2023

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

OpenOOD: Benchmarking generalized out-of-distribution detection

monitoring

Dec 2022

Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, Ziwei Liu

Forecasting Future World Events with Neural Networks

monitoring

Jun 2022

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

Scaling Out-of-Distribution Detection for Real-World Settings

monitoring

May 2022

Dan Hendrycks, Steven Basart, Mantas Mazeika, Mohammadreza Mostajabi, Jacob Steinhardt, Dawn Song

What Would Jiminy Cricket Do? Towards Agents That Behave Morally

robustness

Feb 2022

Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

robustness

Dec 2021

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt

Globally-Robust Neural Networks

robustness

Jul 2021

Klas Leino, Zifan Wang, Matt Fredrikson

APPS: Measuring Coding Challenge Competence With APPS

monitoring

May 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt

MMLU: Measuring Massive Multitask Language Understanding

monitoring

Jan 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

Aligning AI With Shared Human Values

robustness

Aug 2020

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

monitoring

Jun 2020

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

Pretrained Transformers Improve Out-of-Distribution Robustness

robustness

Apr 2020

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, Dawn Song

Overfitting in adversarially robust deep learning

robustness

Mar 2020

Leslie Rice, Eric Wong, J. Zico Kolter

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

robustness

Feb 2020

Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan

Fast is better than free: Revisiting adversarial training

robustness

Jan 2020

Eric Wong, Leslie Rice, J. Zico Kolter

Natural Adversarial Examples

monitoring

Jul 2019

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, Dawn Song

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

robustness

Jun 2019

Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, Dawn Song

Randomized Smoothing: Certified adversarial robustness via randomized smoothing

robustness

Jun 2019

Jeremy M Cohen, Elan Rosenfeld, J. Zico Kolter

Using Pre-Training Can Improve Model Robustness and Uncertainty

robustness

May 2019

Dan Hendrycks, Kimin Lee, Mantas Mazeika

ImageNet-C: Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

monitoring

Mar 2019

Dan Hendrycks
Thomas Dietterich

Deep Anomaly Detection with Outlier Exposure

alignment

Jan 2019

Dan Hendrycks, Mantas Mazeika, Thomas Dietterich

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

alignment

Oct 2018

Dan Hendrycks, Kevin Gimpel

Provable defenses against adversarial examples via the convex outer adversarial polytope

robustness

Jun 2018

Eric Wong, J. Zico Kolter