Years of cutting-edge research power Gray Swan’s most advanced protection for your AI systems.
Rigorous protection to make sure that AI does not veer off course.
Finding out what can go wrong, before it can cause problems.
Enhancing reliability against external threats.
Explore our published research to learn how the latest advances in AI safety and security give Gray Swan the edge against evolving threats.
Current cybersecurity benchmarks fail to capture the real-world capabilities of AI agents, leading to significant underestimation of cyber risk. We conducted the first head-to-head comparison of AI agents and professional penetration testers on a live enterprise network to measure true performance gaps. Our findings reveal that purpose-built agents can outperform 90% of human professionals while operating continuously at a fraction of the cost—exposing critical flaws in how the industry evaluates AI capabilities and safety guardrails.
The safety and alignment of Large Language Models (LLMs) are critical for their respon- sible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning.
The emergence of vision-language-action models (VLAs) for end-to-end control is reshaping the field of robotics by enabling the fusion of multimodal sensory inputs at the billion-parameter scale. The capabilities of VLAs stem primarily from their architectures, which are often based on frontier large language models (LLMs). However, LLMs are known to be susceptible to adversarial misuse, and given the significant physical risks inherent to robotics, questions remain regarding the extent to which VLAs inherit these vulnerabilities.
Recent research on jailbreak attacks has focused almost exclusively in settings where LLMs act as simple chatbots. However, now LLMs are increasingly used in agentic workflows, i.e., equipped with external tools and potentially using many steps to fulfill a user’s request. To address potential safety and alignment concerns coming from LLM agents, we introduce AgentHarm, a new benchmark for measuring harmfulness of LLM agents.
To address the urgent concerns raised by our attack from last July and the numerous jailbreaks that came after, we introduce Circuit Breaking, a novel approach inspired by representation engineering, designed to robustly prevent AI systems from generating harmful content by directly altering harmful model representations. The family of circuit-breaking methods provide an alternative to refusal and adversarial training, protecting both LLMs and multimodal models from strong, unseen adversarial attacks without compromising model capability.
Building on our initial findings, we ventured into the realm of AI interpretability and control with the introduction of Representation Engineering (RepE). Drawing inspiration from cognitive neuroscience, we developed techniques that enable researchers to 'read' and 'control' the 'minds' of AI models. This approach represented a monumental advancement in demystifying the inner workings of AI, making it possible to tackle issues such as truthfulness and power-seeking behaviors head-on.
In July 2023, we published the first-ever automated jailbreaking method on large language models (LLMs) and exposed their susceptibility to adversarial attacks. By demonstrating that specific character sequences could bypass sophisticated safeguards, we highlighted a significant vulnerability that has urgent implications for widely-used AI systems. In its wake, adversarial robustness garnered renewed attention, sparking a gold rush of research dedicated to both jailbreaking and defense.
Get in touch to discuss your custom research needs.
Keep up to date on all things Gray Swan and AI Security.